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Speech Perception* 



Michael Studdert-Kennedy 

Haskins Laboratories, New Haven, Conn. 



"The understanding of speech involves essentially the same problems 
as the production of speech.... The processes. . .have too much in 
common to depend on wholly different mechanisms" (Lashley, 1951:120). 

INTRODUCTION 

We can listen to speech at many levels. We can listen selectively for 
meaning, sentence structure, words, phones, inconation, chatter, or even, at a 
distance, Auden's "high, thin, rare, continuous hum of the self-absorbed." This 
paper is concerned solely with phonetic perception, the transformation of e. more- 
or-less continuous acoustic signal into what may be transcribed as a sequence of 
discrete, phonetic symbols. The study of speech perception, in this sense, has 
in recent years begun to adopt the alms, and often the methods, of the informa- 
tion-processing models of cognitive psychology which have proved fruitful in the 
study of vision (Neisser, 1967; Haber, 1969; Reed, 1973). The underlying assump- 
tion is that perception has a time-course, during which information in the sen- 
sory array is "transformed, reduced, elaborated" (Neisser, 1967:4) and brought 
into contact with long-term memory (recognized). The experimental aim is to in- 
tervene in this process (either directly or by inference) at various points be- 
tween sensory input and final percept, in order to discover what transformations 
the original information has undergone. The ultimate objective is to describe 
the process in terms specific enough for neurophysiologists to search for neural 
correlates. 

Let us begin by considering how speech perception differs from general audi- 
tory perception. It does so in both stimulus and percept. First, the sounds of 
speech constitute a distinctive class, drawn from the set of sounds that can be 
produced by the human vocal mechanism. They can be described, to an approxima- 
tion, as the output of a filter excited by an independent source. The source la 
the flow of air from the lungs, modulated at the glottis to produce a quasi- 
periodic sound, or above the glottis to produce a noisy turbulence. The filter 



*Chapter prepared for Contemporary Issues in Experimental Phonetics , ed. by N. J. 
Lass (Springfield, 111.: C. C Thomas, in press). 

'''Also Queens College and the Graduate Center of the City University of New York. 

Acknowledgment; I thank Alvin Liberman, Ignatius Mattlngly, and Donald 
Shankweller for their valuable comments and criticism, David Pisoni for much 
fruitful conversation anu for drawing my attention to the work of Eleanor Rosch. 
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is the sup f alar yngeal vocal tract, whose varying configurations give rise to 
varying resonancee (formants) » The resulting aound wave may be displayed as an 
oscillogram or, after spectral analysis, as a spectrogram. It is Important to 
bear in mind that the spectrogram do^f? not display the sensory input, but a 
transformation of that input, often presumed to represent the output at an early 
stage of auditory analysis. [For accounts of the speech signal and its mechan- 
isms of production, see Fant, 1960; Stevens and House, 1972; Kent (in Lass, in 
press); Babcock (in Lass, in press).] 

Here our main concern is to at functional differences between speech 
and nonspeech acoustic structure 1 rce^tion. Speech does not lie at one end 
of an auditory (psychological) contxuutim vhich ve can approach by closer and 
closer acoustic (physical) approximation. The sounds of speech are distinctive. 
They form a set of ''natural categories" similar to those described by Rosch 
(1973). She studied form and color perception among the Dani, a Stone-Age people 
of New Guinea, whose language contains "only two color terns which divide the 
color space on the basis of brightness rather than hue" (p. 331), and no words 
for ths2 Gestalt "good forms" of square, circle, and equilateral triangle. She 
found that her subjects were significantly faster in learning arbitrary names 
for the four primary hue points than for other hues, and for the three "good 
forms" of Gestalt psychology than for others. She points to the possible physio- 
logical underpinnings of these "natural procotjrpes." Her work is reminiscent of 
a study by House, Stevens, Sandel, and Arnold (1962). They constructed several 
ensembles of sounds along an acoustic continuum from clearly non^^peech to speech. 
The time taken by subjects to learn associations between sounds and buttons on a 
box was least for the speech ensemble, and did not decrease with the acoustic 
approximation of the ensembles to speech. In short, a signal is heard as either 
speech or nonspeech, and once heard as speech, elicits characteristic perceptual 
functions that we shall discuss below. 

The second peculiarity of speech perception, as we are viewing it, is in 
perceptual response. The final percept is a phonetic name, and the name (unlike 
those for "natural categories" of form and color) bears a necessary, rather than 
an arbitrary, relation to the signal. In other words, speech sounds "name them- 
selves." Notice that this is not true of the visual counterparts of phonetic 
entities: the forms of the alphabet are arbitrary, and we are not concerned 
that, for example, the same visual symbol, P, stands for /p/ in the Roman alpha- 
bet, for /r/ in the Cyrillic. Nothing comparable occurs in the speech system: 
the acoustic correlates of [p] or [r] can be perceived as r^iothlng other than [p] 
or [r] . A central problem for the student of speech perception is to define the 
nature of this inevitable percept. 

LEVELS OF PROCESSING 

Implicit in the foregoing is a distinction between auditory and phonetic 
perception. As a basis for future discussion, we will lay out a rough concep-* 
tual model of the perceptual process (cf. Studdert-Kennedy, 1974; also Day, 
1968, 1970). Ue can conceive the signals of running speech as climbing a hier- 
archy through at least these successive transformations: (1) auditory, (2) pho- 
netic, (3) phonological, (4) lexical, syntactic, and semantic. The levels must 
be at least partially successive, to preserve aspects of teQq>oral order In the 
signal. They must also be at least partially parallel, to permit higher deci- 
sions to guide and correct lower decisions [cf. Turvey's (1973) discussion of 
peripheral and central processes in vision]. 
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The auditory level is itself a series of processes (Fourcin* 1972). Early 
work (Licklider and Miller, 1951) showed that the speech waveform could be 
vastly distorted without serious loss of intelligr'.bility. Spectrographic analy- 
sis (Potter, Kopp, and Green, 1947; Joos, 1948) and speech synthesis (Libensan, 
1957) showed that patterns of speech Importsnt to its perception lay not in its 
wave-fona, but in its time-varying spectrum as revealed by the spectrogram. Ue 
may imagine, therefore, an early stage of the auditory display, soon after coch- 
lear analysis, as the neural correlate of a spectrogram. Notice in Figure 1: 
regions of high energy concentration (formants, usually labeled from the bottom 
up as Fl, F2, F3); different formant patterns associated with the vowels of read 
and book , for example; intervals of silence during stop consonant closure; a 
sharp scatter of energy (noise burst) upon release of the voiceless stop in tp^, 
and fainter bursts following release of the voiced stops in began ; rapid formant 
movements (transitions) as articulators move Into and out of vowels; a nasal 
formant (between Fl and F2) at the end of began ; a broad band of noise associated 
with the fricative of she ; and finally, regular vertical striations, reflecting 
a series of glottal pulses, from wnich fundamental frequency can be derived. A 
later, pertiaps cortical, stage of auditory analysis may entail detection of just 
such features in the spectrographic display. Whether there are acoustic feature 
analyzers specially tuned to speech is an open question that we consider below. 
In any event, the signal has not yet been transformed into the message, and may 
indeed have passed through the same processes as any other auditory input. 

The phonetic level is abstract in the sense that its output is a set cf 
properties not inherent in the signal. They derive from the auditory display by 
processes that must be peculiar to humans, since they can only be defined by 
reference to the human vocal mechanism. These properties correspond to the lin- 
guistic entities of distinctive feature (Jakobson, Fsnt, and Halle, 1963) and 
phoneme. For the psychological reality of these units, there is ample evidence, 
discussed below. There is also evidence that extraction of these units from the 
auditory display calls upon specialized decoding mechanisms (Studdert-Kennedy 
and Shankweiler, 1970). In any event, the output from this level is now speech, 
although much variability remains to be resolved. 

Resolution is accomplished at the phonological level, where processes pecu- 
liar to the listener's language are engaged. Here, tt» listener merges phonetic 
variations that have no function in his Isnguage, treating, for example, both the 
initial segment of [p^It] and the second segisent of [spit] as instances of /p/. 
Here, too, the listener may ahift distinctions across segiients. Interpreting 
English vowel length before a final stop, for example, as a phonetic cue to the 
voicing vslue of the stop. In short, this is the level at which phonetic vari- 
ability is transformed into phonological system. Of course, for untrained lis- 
teners all of the time, and for phoneticians most of the time, the distinction 
between phonetic and phonolof^^lcal levels has little import. Listeners usually 
hear speech in terms of thn categories of their native language (e.g., Lotz, 
Abramson, Gerstman, Ingemann, and Nemser, 1960; Scholes, 1968; Day, 1968, 1969, 
1970a, 1970b) . However, since they may learn (at some pain) to make phonetic 
distinctions, we must assume that phonetic information is available in the sys- 
tem, though unattended in normal listening. Most of the research to be discussed 
has concerned itself with a single language and has not distinguished between 
phonetic and phonological levels. (For extended discussion of experimental para- 
digms that sen^e to reflect several levels of processing from auditory to phono- 
logical, see Cutting, 1973, in press-a.) 
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The upper levels of lexical, syntactic, and s^&antic processing complete 
the nonaal process of speech perception* There is good evidence that outputs 
from these levels can affect phonological and phonetic perception. Miller, 
Heise, and Lichten (1951), for exanple, showed that words were more intelligible 
In a sentence than in a list. Pollack and Pickett (1963) and Lleberman (1963) 
found that words excised from sentences and presented to listeners without syn- 
tactic and semantic context were often not recognized. Several writers (e.g., 
Jones, 1948; Chomsky and Miller, 1963; Chomsky and Halle, 1968) have placed a 
heavy load on the syntactic structure and semantic content of an utterance in 
their accounts of speech perception. However, while these higher levels may 
serve to "deaa" the message when phonetic lapse Is slight (cf. Warren, 1970; 
Warren and Obusek, 1971, Cole, 1973a), and may even be deliberately brought to 
bear while conversing \fith a foreigner in a railway tunnel, their control is not 
sufficient to disguise all slips of the tongue (cf. Frooakin, 1971). Unambiguous 
perception is possible in spite of context, and, as will be seen, presents suf- 
ficient theoretical problems. Bearing in mind our primary distinction between 
auditory and phonetic levels, we turn now to a brief review of acoustic cues and 
of the problems that emerge for perceptual theory. 

THE ACOUSTIC CUES 

Many of the acoustic cues to the phonetic message have been uncovered over 
the past twenty years by the complementary processes of analysis and synthesis. 
Spectrographlc analysis of natural speech suggests likely candidates, such as 
fotmant frequency, formant movement, silent interval* or burst of noise. Synthe- 
sis then permits these 'Valnlmal cues" (Llberman, 1957) to be checked for percep- 
tual validity. Results of this work are described elsewhere (Llbemsn, 1957; 
Fant, 1960, 1968; Mattlngly, 1968, 1974; Flanagan, 1972; Stevens and Mouse, 
1972). Here, we do no more than susmarlze its outcome and frame the problems it 
raises for speech perception. 

The problems are those of invariance and segmentation. The speech signal 
carries neither Invariant acoustic cues nor Isolable segments that reliably cor- 
respond to the invariant segments of linguistic analysis and perception. The 
speech signal can certainly be segmented. Fant (1968) and his colleagues have 
outlined a procedure for dividing the signal in both frequency and tljie, and have 
developed a terminology to describe Its segments* But these do not correspond 
to the phonetic segments of distinctive feature or phoneme. There are excep- 
tions: fricatives and stressed vowels, for example, may present stable and 
more-or-less isolable patterns. But, in general, as Fant (1962) has remarked, a 
single segBient of sound contains information concerning several neighboring seg- 
ments of the message, and a single segment of the message may draw upon several 
neighboring segioenta of sound. In short, the sounds of speech are not physically 
discrete, like letters of the alphabet, but rather are shingled into an intri- 
cate, continuously changing pattern (Llberman, Cooper, Shankweller, and Studdert- 
Kennedy, 1967). 

Whether the source of this shingled pattern is to be found in mechanical 
constraints, neuromuscular inertia, and temporal overlap of successive commands 
to the articulators (Ohman, 1967), or in elegantly controlled, yet variable re- 
sponses to fixed articulatory Instructions (MacNellage, 1970), the result is not 
only a loss of segmentation, but also a loss of acoustic invariance. The cues 
to a given phonetic segment display enormous variability as a function of phonet- 
ic context, stress, and speaking rate (e.g., Kozhevnlkov and Chlstovlch, 1965; 
Stevens, House, and Paul, 1966). 
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As a •tuple instance, consider the acoustic structure of a lairror image 
consonant-voiiel-consonant (CVC) syllable such as [bieb]. Experlo^nts with syn* 
thetic speech have demonstrated the f<nportsnce of second and third formant 
transitions as cues for distinguishing among labial, alveolar» and velar stops 
(Libexman, Dalattre, Cooper, and Gerstasn, 1954). Here, the tvc foroants rise 
rapidly over the first 40 msec or so into the vovel, and then, after a relatively 
sustained formant pattern for, say, 200 msec, drop rapidly back to their start- 
ing points. The acoustic cues to initial end final allophones of [b] are mirror 
images, and, separated from the syllable, are heard as distinct nonspeech 
sounds. Experiments with tone glissandos' matching such patterns in duration and 
frequency range reveal no psychoacoustic basis for the perceived phonetic Iden- 
tity (Klatt and Shattuck, 1973). 

Similar discrepancies occur as a function of vowel context. Initial for- 
mant transitions in a CV syllable reflect the changing resonances of the vocal 
tract as the articulators move from consonant closure, or constriction, into a 
more open position for the follo%rlng vowel. Since vowels are distinguished by 
the positions of their first two or three formant centers on the frequency scale 
(Delattre, Liberman, Cooper, and Gerstman, 1952; Peterson and Barney, 1952), 
consonantal approach varies with vowel: for example, both second and third for- 
mant s fall in the syllable [dae ] ; the si%cond ri8if;9 and the third falls In the 
syllable [de]. Yet listeners fail to detect these acoustic differences* and 
phonetic identity of the Initial segments is preserved. 

As a final example, consider vowels. Each stressed vowel, spoken In Isola- 
tion, has its characteristic set of formant frequencies. However, In running 
speech, these values are seldom reached, particularly If speech Is rapid and 
vowels unstressed (Lindblom, 1963). If vowel portions are excised from running 
speech and presented without their surrounding formant transitions, identifica- 
tions shift (Fujimura and Ochiai, 1963). This suggests (as do the consonantal 
examples given above) that listeners track formants over at least a syllable in 
order to make their phonetic decisions. (For other examples of phonetic iden- 
tity in face of acoustic variance, see Shearme and Holmes (1962), Lindblom 
(1963), Ohman (1966), Liberman et al. (1967), and Stevens snd House (1972).) 

A different class of acoustic variability Is instanced by interspeaker vari- 
ations. Here differences in acoustic quality can be clearly beard, but are dis- 
regarded in phonetic perception. Center frequencies of vowel formants vary 
widely among men, women, and children (Peterson and Barney, 1952), with the re- 
sult that acoustically identical patterns may be judged phonetically distinct, 
while acoustically distinct patterns may be Judged phonetically Identical. 
"Mormalization" probably cannot be accomplished by application of a simple scale 
factor (Peterson, 1961) because male-female formant ratios are not constant 
across the vovel quadrilateral (Fant, 1966). 

A favored belief is that listeners Judge vowels by reference to other vow- 
els uttered by the same apeaker. This notion originated with Joos (1948) and 
was tested by Ladefoged and Broadbent (1957). They demonstrated that the ssme 
synthetic vowel pattern could be Judged differently, depending on the formant 
pattern of a precursor phrase. Gerstmsn (1968) developed an algorithm, derived 
from the formant frequencies of [i,a,u] for each speaker, that correctly identi- 
fies 97.5 percent of the Peterson and Barney (1952) vowels. AndLleberman (1973) 
claims that unless a listener has heard "calibrating signals," such as the vow- 
els [i,a,u] or the glldea [y] and [w], from which to assess the size of a 
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particular speaker's vocal tract, "it is impossible to assign a particular 
acoustic signal into the correct class" (p. 91). 



However, an algorithm is not a perceptual model, and remarkably little is 
actually known In this area: there is a dearth of data on how listeners Judge 
the varied vowel patterns o£ different speakers. Furthermore, the phenomenon of 
normalization is not confined to vowels. Fourcin (1968) demonstrated that a 
synthetic "whispered" syllable with a constant f ormant pattern could be heard as 
a token of [d] if preceded by a man's hallo , of [b] if preceded by a child's. 
Rand (1971) showed a similar systematic shift, without benefit of precursor, 
when formant frequencies of synthetic CV syllables were increased by 20 percent 
above the "male" base. Evidently, normalization can be accomplished within a syl- 
lable, presumably from information provided by formant structure and fundamental 
frequency (cf. Fujisaki and Nakamura, 1969). This is precisely what is suggested 
by recent work of Strange, Verbrugge, and Shankweiler (1974) and Verbrugge, 
Strange, and Shankweiler (1974). They find that a speaker's precursor vowels, 
whether [l,a,u] or [X,£e,A], do little to reduce listener error In Judging fol- 
lowing vowels spoken by a panel of men, women, and children. Far more effective 
In reducing error is presentation of the vowel within a consonantal frame. Of 
course, formant reference is clearly Involved in studies where consonantal con- 
text is held constant (Summerfield and Haggard, 1973). However, the results 
again suggest perceptual tracking of an entire syllable, and eiqphasize that in- 
variant acoustic segments matching the Invariants of perception are not readily 
found. [For a recent review of the normalization problem, see Shankweiler, 
Strange, and Verbrugge (ift press).] 

Nonetheless, the search for acoustic invarlance has not been abandoned. A 
main reason for this is the obvious worth of some form of feature theory in 
linguistic description and, incidentally, in the description of listener behav- 
ior (see next section). Distinctive-feature theorists have always maintained 
that correlates of the features are to be found at every level of the speech 
process — articulatory, acoustic, auditory — (Jakobson and Halle, 1956; Jakobson, 
Fant, and Halle, 1963; Chomsky and Halle, 1968), and a good deal of current re- 
search is directed toward grounding features in acoustics and physiology (cf . 
Ladefoged, 1971a, 1971b; Lindblom, 1972). 

Before giving examples, we should emphasize the redundancy of the speech 
signal. A given feature may be signaled by several different cues. Studies of 
synthetic speech have tended to emphasize "sufficient" cues and to disregard 
their Interaction. Harris (1958) provides an exception, in her study of noise 
bands and formant transitions as cues to English fricatives. So, too, do Harris, 
Hoffman, Llberman, Delattre, and Cooper (1958) and Hoffman (1958), who examined 
the relative weights of second and third formant transitions in the perception 
of English voiced stops. 

Finally, exceptions are also provided by Lisker and Abramson (1964, 1967, 
1970, 1971) and by Abramson and Lisker (1965, 1970; see also Zlatin, 1974) in 
an extensive series of studies of voicing in many languages. Noting that voic- 
ing in initial stops may be cued by explosion energy, degree of aspiration, and 
first formant intensity, they sought a cover variable that would encompass all 
these cues. They found it in voice onset time (VOT), the interval between re- 
lease of stop closure and the onset of laryngeal vibration. Figure 2 displays 
spectrograms of synthetic stops in which VOT is a sufficient cue for the dis- 
tinction between [ba] and [pa]. Notice that VOT is not a simple variable, 
either articulatorily or acoustically: it refers to a temporal relation between 
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+ 100 msec 



Figure 2: Spectrograoui of synthetic syllables, [be] and [pa]. The Interval 
between release and voicing (vertical striations) (VOX) is 10 msec 
for [ba]» 100 msec for [pa]. During this interval, Fl is absent 
and the regions of F2 and F3 are occupied by "aspirated" noise. 
[After Lisker and Abraoson, with permission.] 
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two distinct events* In production, it calls for precise timing of a laryngeal 
gesture (approximation of the vocal cords) in relation to supralaryngeal release; 
in perception, it calls for Judgment of a complex acoustic pattern arrayed in 
time. Nonetheless, within these limits, VOT offers a relatively invariant 
physical display and a relatively invariant sequence of coordinated articulatory 
gestures that might serve to define a feature, albeit not a feature within the 
generally accepted system (Chomsky and Halle, 1968). [For a full account of the 
underlying rationale, the reader is referred to the publications cited above 
and, for discussions of the approach, to Stevens and Klatt (197A) and Sumnerfield 
and Haggard (1972); see also Haggard, Ambler, and Callow (1970).] 

A second example of the search for feature invariants is provided by the 
work of Stevens (1967, 1968a, 1968b, 1972a, 1972b, 1973). In a recent paper, 
for example (Stevens, 1973), he approaches an acoustic definition of [+ Conso- 
nantal], describing consonants as displaying *'a rapid change in the acoustic 
spectrum" (p. 157) in the region of F2, following release (cf. Fsnt, 1962). He 
develops this description, emphasizing the entire spectrum rather than individ- 
ual formants» into an acoustic account of place features [4> Coronal], [+ Labial], 
and [+ Velar], for which he posits "property detectors." The acoustic descrip- 
tion is based on spectrographic analj^sis and computations from an idealized 
vocal tract model. The model reveals certain "quantal places of articulation 
which are optimal from the point of view of sound generation" (Stevens, 1968a: 
200) since they permit relatively imprecise articulation without serious pertur- 
bation of the signal. Obviously, these tract shapes can be correlated with ar- 
ticulatory gestures to provide the needed feature correlates. 

Finally, less ambitious attempts to discover feature invariants are in- 
stanced by tape-cutting experiments with natural speech, in which consonantal 
portions of a syllable are removed and presented for identification alone or 
with vowels other than the original (Fischer-J^rgensen, 1972; Cole and Scott, 
1974). If this approach leads to precise definition of acoustic Invariants, it 
will have proved valuable. However, if experiments merely demonstrate that 
transposing initial portions of two CV syllables, for example, yields no change 
in perception of initial consonant, we have not advanced. The transposed pat- 
terns remain different both acoustically and, if removed from the speech stream, 
psychoacoustically, and the demonstrated source of invariance is still the 
listener. The ultimate test of all these attempts will be in control of a 
speech synthesizer from a set of invariant articulatory or acoustic feature 
specifications (Mattingly, 1971). 

THE PHONETIC PERCEPT 

Up to this point we have simply assumed the units of speech perception. 
However, research has sporadically puzzled over their definition for the past 25 
years. The puzzle arises, as we have seen, from the mismatch between the acoust- 
ic signal and the abstract entities of linguistic snalysis, distinctive features, 
and phonemes. Nonetheless, each of these units has been shown to have psycho- 
logical reality. Perhaps the most direct evidence comes from studies of speak- 
ing errors. Fromkln (1971) has analyzed many utterances for errors of metathe- 
sis (spoonerism) . She finds that speakers may metatheslze not only words and 
phrases, but syllables (clarinet and viola darinola) , phonemes (far more — ^ 
mar fore) , and features (clear blue — » glear plue) (cf. Boomer and Laver, 1968; 
MacKay, 1970; Cairns, Cairns, and Willisms, 1974). Of particular interest is 
her observation that speakers may exchange consonant for consonant and vowel for 
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vowel, but never consonant for vovel. This reflects a distinction in production 
between phonetic eleoients of the syllable that are, as we shall see, repeatedly 
distinguished In perception. In any event, errors of aetathesls logically re- 
quire that the speaker have Independent control over the unit of error. And if 
these units are Independently produced, it is reasonable to believe that they 
are Independently perceived. 

Evidence from perceptual studies is not lacking. Errors of subjects lis- 
tening to speech through noise (Miller and Nicely, 1955; Mitchell, 1973) or 
dlchotlcally (Studdert-Kennedy and Shankweller, 1970; Studdert-Kennedy, 
Shankweller, and Plsonl, 1972; Blumsteln, 1974), are patterned according to some 
form of feature system. Scaling studies, in which the experimenter attempts to 
determine the psychological space occupied by a set of consonants or vowels, re- 
peatedly reveal a structure parsimoniously described by feature theory (Greenbfirg 
and Jeniins, 1964; Singh, 1966; Hanson, 1967; Singh and Woods, 1970; Shepsrd, 
1972) . A new paradigm has recently provided further evidence. Goldstein and 
Lackner (In press), adapting a technique devised by Warren and Gregory (1938; 
also Warren, 1968, and in Lass, in press), played a 200 msec nonsense syl.Table 
over and over (200 times per minute), asking listeners to report what they 
heard. After a few repetitions, listeners began to hear different words (verbal 
transf oxmatlon) . The new words were systematically related to the originals: 
they entailed changes in value of only one or two distinctive features, and re- 
flected phonological constraints of English as described by distinctive feature 
theory. Finally, errors in short-term memory studies also follow a feature pat- 
tern (Sales, Cole, and Haber, 1969; Wlckelgren, 1965, 1966). Several of these 
studies have used their perceptual data to compare the predictive power (and so 
the validity) of different feature systems. Such work is particularly Important 
if linguistics is to be regarded as a branch of human psychology (Chomsky, 1972), 
and if the abstract units of phonology are to be grounded in human artlculatory 
and perceptual capacities (Ladefoged, 1971a, 1971b; Llljencrants and Llndblom, 
1972; Llndblom, 1972). 

The perceptual status of the columns In a feature matrix has proved more 
controversial. Functionally, the column (phone) represents the grouping of dis- 
tinctive features within a syllable, specifying the domain within which a par- 
ticular feature is to apply. We recognize this perceptually in alliteration 
(big boy ) and in rhyme (bee and see) , where two syllables are perceived as iden- 
tical at their beginning, but not at their end, or vice versa. Listeners reveal 
this function when asked to Judge similarities among words. Vitz and Winkler 
(1973) found, in fact, that the nrnb^t of phones shared by a pair of words was a 
more satisfactory predictor of their Judged similarity than the number of shared 
features. In the verbal transformation study described above (Goldstein and 
Lackner, in press), transformations were be^it described in terms of phones and 
features rather than syllables and features: consonant transforms and vowel 
transforms, for exaiq>le, were independent, reflecting feature shifts within, but 
not across, phones. Finally, several studies (Kozhevnikov and Chlstovlch, 1965; 
Savin and Sever, 1970; Day and Wood, 1972) have shown reaction time differences 
in identification of consonants and vowels within the same syllables. These 
differences would not occur if the syllable were an unanalyzed perceptual entity. 

Despite such evidence and despite the clear role of phoneme-size phonetic 
segments in speaking and in writing systems, students have been tempted to re- 
gard these segments as "nonperceptual" (Savin and Bever, 1970) or as "fictitious 
units" based on the historical accident of alphabet invention [Warren (in Lass, 



10 



ERIC 



16 



In press)]. Among the arguaents for this conclusion seem to be three solid 
facts* two (or more) pieces of ambiguous evidence, and one false belief. The 
facts are: first, that no phoneme-size segment can be isolated in the acoustic 
signal; second, that some phonemes (stop consonants) cannot be spoken In Isola- 
tion; third, that ve do speak in syllables and that syllables are the carriers 
of stress and speech rhythm. The ambiguous evidence comes from reaction time 
studies suggesting that syllables, and even higher order units, may be identi- 
fied before the elements of which they are composed. Savin and Sever (1970) and 
Warren (1971) showed that the reaction time of listeners monitorf-^g a monosyl- 
labic list for syllables is faster than their reaction time when monitoring the 
same list for the initial phoneme of the syllable. Subsequently, Foss and 
Swinney (1973) shoved that, under similar conditions, listeners responded more 
rapidly to words than to their component syllables, while Sever (1970) revealed 
that listeners responded more rapidly to three-word sentences than to their comr 
ponent words. It was left to McNeill and Lindin (1973) to release us from this 
"Looking Glass" world, in which the trial precedes the crime, by demonstrating 
that reaction time was always fastest to the largest elements of which a list 
was composed. In other words, listeners* response is most rapid at the level ov 
linguistic analysis to which context has directed their attention. 

Finally, the false belief is that invariance and segmentaMon problems 
would disappear if the syllable were an unanalyzed unit of perception. This be- 
lief is no better founded than Wickelgren's (1969) attempt to solve the invari- 
ance problem by positing context-sensitive allophones. and is open to many of 
the same objections. These objections have been well summarized by Halves and 
Jenkins (1971) , and we will not review them here. However, it Is worth sdding 
that the syllable has resisted acoustic definition only somewhat less than the 
phoneme-size phonetic segment. Its nucleus may be detected by amplitude and 
fundamental frequency peak picking (Lea, 1974), and Malmberg (1955) drew atten- 
tion to the possible role of formant transitions in defining syllable boundaries, 
but no fully satisfactory definition has yet emerged. Furthenoore , coarticula- 
tion and perceptual context effects across syllables, though less marked than 
acro3f) phones, still occur. Ohman (1966), for example, found drastic variations 
in v.'Wel formant transitions on either side of stop closure, as a function of 
the vowel on the opposite side of the closure. And Treon (1970) has demonstrated 
contextual effects in perception extending across two to three syllables. In 
fact, as Fodor, Sever, and Garrett (1974) hint, an account of syllable perception 
may well require the same theoretical apparatus as an account of phone percep- 
tion. 

Huch of the confusion over units of speech perception might be resolved if 
the distinctions between signal and message, snd among acoustic, phonetic, and 
higher levels were strictly maintained. There is wide agreement among writers, 
whose views may otherwise diverge, that the basic acoustic unit of speech per- 
ception (&nd production) is of roughly syllabic length [e.g., Llberman, Delattre, 
and Cooper, 1952; Libennan, 1957; Kozhevnikov and Chistovich, 1965; Ohman, 1966; 
Ladefoged, 1967; Llberman et al., 1967; Savin and Sever, 1970; Massaro, 1972; 
Stevens and House, 1972; Cole and Scott, 1973; Kirman, 1973; McNeill and Repp, 
1973; Warren (in Lass, In press); Studdert-Kennedy, in press]. This is not to 
deny that there are longer stretches of the signal over which the perceptual 
apparatus must compute relations, but simply to say that the smallest stretch of 
signal on which it goes to work is produced by th& articulatory syllabic gesture 
(Stetson, 19:>2). This does not mean [as Massaro (1972), for example, seems to 
suppose] that the syllable is the basic linguistic and perceptual unit. 
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Ve may clarify by conceptualizing the process of constructing an utterance 
from a lexicon of morpheses. The abstract entity of the morpheme Is the funda- 
mental unit In which semantics, syntax, and phonology converge. Each morpheme 
la constructed from phonaoea and distinctive features. AC this level, the syl- 
lable doea not exist. But morphemic structure Is matched to (and must ultimate- 
ly derive from) the artlculatory capacities of the speaker. Both universal and 
language-specific phonotactlc constraints ensure that a morpheme will eventuate 
In pronounceable sequences of consonants and vowels. Under the control of a 
syntactic system governing their order and prosody, the morphemes pass through 
the phonetic transform Into a sequence of coartlculated gestures. These ges- 
tures give rise to a sequence of acoustic syllables. Into which the acoustic 
correlates of phoneme and distinctive feature are woven. The listener's task is 
to recover the features and their phonemic alignment, and so the morpheme and 
meaning. In short, perception entails the analysis of the acoustic syllable, by 
means of its acoustic features, into the abstract perceptual structure of fea- 
tures and phonemes that characterize the morpheme. Ve now turn to some theoret- 
ical accounts of how this might proceed. 

MODELS OF PHONETIC PERCEPTION 

We have no models specified in enough detail for serious test. But a brief 
account of two approaches that have Influenced recent research may serve to sum- 
marize the discussion up to this point. The two approaches are those of the 
Hasklns Laboratories investigators and of Stevens and his colleagues at the 
Hassachusetts Institute of Technology. Both groups are Impressed, In varying 
degrees, by the invar lance and segmentation problem. Both have therefore re- 
jected a passive template- or pattern-matching model in favor of an active or 
generative model. (For a review, see Cooper, 1972.) 

Llberman et al. (1967), reformulating a theme that had appeared in many 
earlier papers from the Hasklns group, proposed a "motor theory of speech per- 
ception." The crux of their argument was that an artlculatory description of 
speech is not merely simpler, but is the only description that can rationalize 
the temporally scattered and contextually variable pattema of speech. They 
argue that phonetic segments undergo, in their passage through the artlculatory 
system, a process of "encoding." They are restructured acoustically in the syl- 
labic merger, so that cues to phonetic identity lose their alignment and are 
distributed over the entire syllable (Llberman, 1970). Not all phonetic seg- 
ments undergo the same degree of restructuring: there is a hierarchy of encoded- 
ness, from the highly encoded stop consonants* through nasals, fricatives, 
glides, and semivowels, to the relatively unencoded vowels. Nonetheless, recov^ 
ery of phonetic segments from the syllable calls for parallel processing of both 
consonant and vowel; neither can be decoded without the other. And this demanda 
a specialized decoding mechanism, in which reference is somehow made to the ar- 
tlculatory gestures that gave rise to the encoded syllables. 

Llberman et al. (1967) assume, reasonably enough, that "at some level... of 
the production system there exist neural signals standing in one-to-one corre- 
apondence with the various segnents of the language," and th^ for the phoneme 
"the invariant Is found far down in the neuromotor system, at the level of the 
coomands to the muscles" (p. 454). It is important to note that actual a»tor 
engagement is not envisaged. Llberman (1957) has written: "We must assume that 
the process is somehow short-circuited — that is, that the reference to artlcula- 
tory movements and their sensory consequences must somehow occur In the brain 
without getting out into the periphery" (p. 122). 
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A virtue of the rooi^el is that it accounte for a fair amount of data and has 
generated a steady stream of research* Also, the concept of encoding, though 
descriptive rather than explanatory, draws attention to a process at the base of 
language analogous to syntactic processes suggested by generative grasmar, and 
hints at formal similarities in the physiological processes underlying phonetic 
and syntactic performance (Mattingly and Libexmac. 1969; Liberman, 1970; 
Mattlngly, 1973, 1974). Conspicuously absent is any account of first- language 
acquis it ion • The child may be presumed to be bom with some ^'knowledge** of 
vocal tract physiology and an incipient capacity to interpret the output of an 
adult tract in relation to that of its own (Mattingly, 1973), but a detailed 
account of the process is lacking. 

Stevens (1973) has concerned himself with this problem, and addresses it in 
the most recent version of his analysis-by-synthesis model (Stevens, 1972a; cf • 
Stevens, 1960; Stevens and Halle, 1967). The model is far more explicit than 
that of the Haskins group. The perceptual process is conceived as beginning 
with some form of peripheral spectral analysis, acoustic feature end pitch ex- 
traction. Fitch and spectral information, over a stretch of several syllables, 
is placed in auditory store. Acoustic feature information undergoes preliminary 
analysis by which a rough matrix of phonetic segments and features is extracted 
and passed to a control system. On occasion, this matrix may provide sufficient 
information for the control (which knows the possible sequences of phonetic seg- 
ments and has access to the phonetic structure of earlier sections of the utter- 
ance) simply to pass the description on to higher levels. If this is not possible, 
the control guesses at a phonetic description on the basis of its inadequate in- 
formation and sends the description to a generative rule system, the same that 
in speaking directs the articulatory mechanism. The rule system generates a 
version of the utterance and passes it to a comparator for comparison with the 
spectral description in temporary auditory store. The comparator computes a 
difference measure and feeds it back to the control. If the **error** is small 
enough, the control system accepts its original phonetic description as correct. 
If not, it makes a second guess and the cycle repeats until an adequate match is 
reached. 

Tais rough account does no Justice to the model's elegance and subtlety, 
but it may serve to focus attention on several points. Firsts the solution to 
the invariance problem is a more abstract and more carefully specified version 
of a motor theory. Second, the model emphasizes the necessity of at least a 
preliminary feature analysis, to ensure that the system is not doomed to an in- 
finity of bad guesses, and that the child^ given a set of innate "property de- 
tectors,** can latch onto the utterance. At the same time» no account is offered 
of how the invariant acoustic properties are transformed into phonetic segments 
and features (the process is simply consigned to "a preliminary analysis**) t nor 
of the precise form that the phonetic description takes. Finally, the model 
en^hasizes the need for a short-term auditory store. As we shall see, the form 
and duration of such a store is currently the focus of a great deal of research. 

THE FHOCESSING OF CONSOHANTS AND VOWELS 

Frellmlnary 

To brace ourselves for a fairly prolonged discussion of consonants and 
vowels, let us consider why they are interesting. For theory, the answer is 
that they lie at the base of all phonological systems. All languages are 
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syllable, and all languages constrain syllabic structure in terns of consonants 
and vowels. If we are to ground phonological theory in husan physiology, we 
must understand why this path was taken. Lleberman (1970) has argued that pho- 
nological features may have been selected through a combination of articulatory 
constraints and "best matches" to perceptual capacity. One purpose of current 
research is to understand the nature and basis of the best match between sylla- 
bles, constructed from consonants and vowels, and perceptual capacity. 

For experiment, the interest of consonants and vowels is that they are dif- 
ferent. If all speech sounds were perceived in the same way, we would have no 
means of studying their underlying relations. Just as the biologist could not 
study the genetics of eye-color in Drosophila melanogaster until he had found 
two flies with different eyes, so the student of speech had no means of analyz- 
ing syllable perception until he had found portions of the syllable that re- 
flected different perceptual processes (cf . Stetson, 1952). Fortunately, the in- 
terests of theory and research converge. 

Categorical Perception 

Study of sound spectrograms reveals that portions of the acoustic patterns 
for related phonetic segments (segments distinguished from one another by a 
single feature) often lie along an apparent acoustic continuum. For example, 
center frequencies of the first two or three formants of the front vowels /i,I, 
e,ae/ form a monotonlc series; syllable-initial voice-voiceless pairs /b,p/, 
/d,t/, /k,g/ differ systematically in voice onset time; voiced stops /b,d,g/ be- 
fore a particular vowel, differ primarily in the extent and direction of their 
formant transitions. 

To establish the perceptual function of such variations speech synthesis is 
used. Figure 3 sketches a schematic spectrogram of a synthetic series in which 
changes of slope in F2 transition effect perceptual changes from /b/ through /d/ 
to /g/. Asked to identify the dozen or so sounds along such a continuum, lis- 
teners divide it into distinct categories. For example, a listener might consis- 
tently identify stimuli -6 through -3 of Figure 3 as /b/, stimuli '1 through -t-S 
as /d/, and stimuli -t-S through +9 as /g/. In other words, he does not, as might 
be expected on psychophysical grounds, hear a series of stimuli gradually chang- 
ing from one phonetic class to another, but rather a series of stimuli, each of 
which (with the exception of one or two boundary stimuli) belongs unsmbiguously 
in a single class. The Important point to note is that, although steps along 
the continuum are well above nonspeech auditory discrimination threshold, lis- 
teners disregard acoustic differences within a phonetic category, but clearly 
hear equal acoustic differences between categories. 

To determine whether listeners can, in fact, hear the acoustic differences 
belied by their identifications, dlscrlmlnstlon tests are carried out, usually 
in ABX format. Here, on a given trial, the listener hears three stimuli, sepa- 
rated by a second or so of silence: the first (A) is drawn from a point on the 
continuum two or three steps removed from the second (B), snd the third (X) Is a 
repetition of either A or B. The listener's task is to say whether the third 
stimulus is the ssme as the first or the second. The typical outcome for a stop 
consonant continuum, is that listeners hear few more auditory differences than 
phonetic 'Categories: they discriminate very well between stlBHili drawn from dif- 
ferent phonetic categories, and very poorly (somewhat better than chance) between 
stimuli drawn from the ssme category. The resulting function displays peaks at 
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Figure 3: Schematic spectrogram £or a series of synthetic stop-vowel syllables 
varying only In F2 transition. F2 steady-state, Fl transition, and 
steady-state remain constant. As F2 transition changes from -6 to 49 » 
perception of initial consonant shifts from [b] through [d] to [g]. 



phonetic boundaries, troughs within phonetic categories. In fact, discrimina- 
tive performance can be predicted with fair accuracy from identifications: the 
probability that acoustically different syllables are correctly discriminated is 
a positive function of the probability that they are differently identified 
(Liberman, Harris, Kinney, and Lane, 1961). This close relation between identi- 
fication and discrimination has been termed "categorical perception": that is to 
say, perception by assignment to category. Figure 4 (left side) Illustrates the 
phenomenon. Note that, although prediction from identification to discrimination 
is good, it is not perfect: listeners can sometimes discriminate between differ- 
ent acoustic tokens of the same phonetic type. Note, further, that neither iden- 
tification nor discrimination functions display quantal leaps across category 
boundaries. This is not a result of data averaging, since the effect is given by 
individual subjects. Evidently auditory Information about consonants is slight, 
but not entirely lacking. 

We may now contrast categorical perception of stop consonants with "continu- 
ous perception" of vowels. Figure 4 (right side) illustrates the effect. There 
are two points to note. First, the vowel identification function is not as 
clear-cut as the consonant. Vowels, particularly those close to a phonetic 
boundary, are subject to context effects: for example, a token close to the 
/i-I/ boundary will tend to be heard, by contrast, as /i/, if preceded by a 
clear /!/, as /I/, if preceded by a clear /I/. The second point to note is that 
vowel discrimination is high across the entire continuum. Phonetic class is not 
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Figure 4: Average identlficetion functions for synthetic series of stop conso- 
nants and vowels (top). Average one-step (middle) and tffo-step 
(bottom) predicted and obtained ABX dlscrlodnatlon functions for the 
saae series. [After Plsonl (1971), with permission of the author.] 



16 



ERIC 



22 



totally Irrelevant (there la a peak in the dlacrimlnatlon function at the cate- 
gory boundary), but both within and between categories listeners discriminate 
nany nore differences than they identify. Their perception is s!?id to be "con- 
tinuous." [For fuller discussion, see Studdert-Kennedy, Llbensan, HarrJls, and 
Cooper (1970a) and Pisoni (1971).] 

Continuous perception is typical not only of vowels, but also of many non- 
speech psychophysical continue along which we can dlscrlalnate nore steps than 
we can identify (Miller, 1956). This fact has been taken as evidence both that 
categorical perception is peculiar to speech, and that atop consonants and vovr- 
els engage fundamentally different perceptual processes (LiberxDan et al., 1967; 
Studdert-Kennedy et al., 1970a). In fact, an early account of the phenoaenon 
invoked a motor theory of speech perception (Liberman et al., 1967). As we have 
seen, there are independent grounds for hypothesizing that speech is perceived 
by reference to its articulatory origin. Here seemed to be additional evidence: 
the discrete articulatory gestures of stop consonants yielded discrete percep- 
tual categories; the more variable gestures of vowels, more variable categories. 
But this account has several weaknesses, and recent work has largely eroded it. 
For one thing, we now know that categorical perception is not confined to speech 
(Locke and Keller, 1973; Miller, Pastore, Wier, Kelly, and Dooling, 197 A; 
Cutting and Rosner, in press). 

However, this discovery in no way diminishes the importance of the phenome- 
non, as will becoioe clear in the following sections. Here, we merely note two 
facts c First, the acoustic patterns distributed along a speech continuum are 
not arbitrary. They have been selected from the range of patterns that the ar- 
ticulatory apparatus can produce and that the auditory system can analyze. The 
categories are therefore natural , in the sense that they reflect physiological 
constraints on both production and perception. As Stevens (1972b) has pointed 
out, our task is to define the Joint auditory and articulatory origin of phonet- 
ic categories. 

Second, categorical perception reflects a functionally issportant property 
of certain speech sounds. The initial sound of /da/, for example, is difficult, 
if not Impossible, to hear: the sound escapes us and we perceive the event, al- 
most instantly, as phonetic. Rapid sensory decay and transfer into a nonsensory 
code is probsbly crucial to an efficient linguistic signaling system. Study of 
categorical perception has, in fact, revealed functional differences between 
stop consonants and vowels that are central to the syllabic structure of speech. 
At the ssme time it has provided basic evidence for the distinction between audi' 
tory and phonetic levels of processing. 

In the following sections we consider two main aspects of categorical per- 
ception: first, the division of a physical continuum into sharply defined cate- 
gories, and the assignment of names to the categories; second, listeners* 
apparent inability to discriminate among members of a category. 

The Baaes of Phonetic Categories 

Phonetic categories do not arise from simple discriminative training, as 
proposed by Lane (1965). Subjects may certainly learn to divide a sensory con- 
tinuum into clear-cut categoriea, with a resultsnt small peak in the discrimina- 
tion function at the category boundary. But discrimination within categories 
remains high (Parks, Wall, and Baatian, 1969; Studdert-Kennedy et al., 1970a; 
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Pltonl, 1971): training uy IncreMe, but not obliterate dlecriainAtlve capee- 
Ity. Furthermore, the learned bouadary la likely to be unatable. The process 
is familiar to tha psychophysicist . For axsmple, if we present a subject with a 
series of weights and ask him to judge each weight as either heavy or light, he 
will, with a niftimum of practice, divide th£ range cleanly around Its balance 
point (see Woodworth and Schlosberg, 1954:Ch* 8). However, the boundary between 
heavy and light can be readily shifted by a change in experimental procedure. 
l£ an extreme token is presented for Judgment with a probability several tinea 
tha.*: of other stimuli along the continuum, it comes to serve as an anchor with 
which other stimuli contrast: the result is a shift in category boundary toward 
the anchoring stimulus. Pisoni and Sawusch (in press) have shown that such 
shifts occur for a series of tones, differing in intensity, and for vowels, but 
not for stop consonants distributed along a voice-onset time continuum. They 
suggest that response criteria for voicing categories are mediated by Internal 
rather than external references. By thus reframing the observation that stop 
consonant categories are not subject to context effects, they invite us to con- 
sider the nature of the internal reference. 

Such a reference must be some distinctive perceptual quality shared by all 
members, and by no nonmeaibers, of a category. There is, of course, no reason to 
suppose that distinctive perceptual qualities are confined to speech continue. 
They will emerge from any physical continuum for which sensitivity is low within 
restricted regions and, by corollary, high between these regions. However, 
while the distinctive perceptual quality of a nonspeech event (such as a click, 
a musical note, or a flaah of light) has the character of its sensory mode, the 
distinctive perceptual quality of a speech sound is phonetic. It is into a pho- 
netic code that speech sounds are rapidly and automatically transferred for stor- 
age and recall. 

With this in mind, we turn to several studies of nonapeech continue. Ve 
begin with Cutting and Rosner (in press) , who determined an auditory boundary 
between rapid and slow stimulus onsets. Variations in stimulus onset, or rise 
time, are known to contribute to the affricate/fricative distinction, /t/a/ versus 
//a/ (Gerstman, 1957). The authors varied rise time from 0 to 80 msec for saw- 
tooth wave trains, generated by a Hocg synthesizer, and for sjnithetic affricate/ 
fricatives. The rapid-onset sawtooth waves sounded like a plucked guitar string, 
the olow-onaet waves like a bowed string. Cutting and Rosner presented their two 
classes of stimuli for identification (pluck - bow , /t/a/ - //a/) and for ABX 
discrimination. Both speech and nonspeech yielded category boundaries at a 40-50 
msec rise time, with appropriate peaks and troughs in the discrimination func- 
tions. 

A second instance of nonspeech categorical perception is provided by Miller 
et al. (1974). These investigators constructwl a rough nonspeech analog of the 
voice-onset time continuum. They varied Che relative onset times of bursts of 
noise and periodic buzz, over a range of noise-leads from -10 to <f80 msec, and 
presented them to subjects for labeling and discrimiuation. Listeners divided 
the continuum around an average noise-lead of approximately 16 msec, displaying 
clear discrimination troughs within no noise-lead and noise-lead categories, and 
a discrimination peak at the category boundary. The boundary value agrees re- 
markably well with that reported by Abramson and Lisker (1970) for the English 
labial VOT continuum, though not with the systematically longer perceptual bound- 
aries associated with English apical and velar VOT continue (Lisker and Abramson, 
1970). The authors conclude that the categoriea of their experiment (and. 
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presumably, of at least the Engliah labial VOT continuua) lie on either side of 
a "difference llaen" for duration of the leading noise* While possibly correct, 
their condusfwon places a nialeadlng emphasia on the boundary between categories 
rather than on the categories themselves. 

The eophasis is reversed in a recent atudy of Stevens and Klatt (197A). 
Following Liberman, Delattre, and Cooper (1958), they examined auditory discrim- 
ination of two acoustic variables along the stop consonant voice-voiceless con- 
tinuum: delay in formant onset and presence/absence of Fl transition. For 
their first experiment they constructed a nonspeech analog of plosive release 
and following vowel: a 5 msec burst of noise separated from a vowel- like buzz 
by between 0 and 40 msec of silence. Listeners* "threshold" for detection of 
silence between noise and buzz was approximately 20 msec, a close match with the 
value for detection of noise lead found by Miller et al. (1974). Stevens and 
Klatt (1974) imply that the unaspirated/sspirated stop consonant perceptual 
boundary in the 20-40 msec VOT range may represent "a characteristic of the audi- 
tory processing of acoustic stimuli independent of whether the stimuli are speech 
or nonspeech" (p. 654). 

We will not pursue the details of their second experiment. However, they 
were able to confirm the contribution of a detectable Fl transition to the voice- 
voiceless distinction. Furthermore, by hewing to the articulated speech signal 
and by focusing on acoustic properties within categories rather than on acoustic 
differences between them, Stevens and Klatt were able to offer a fully plausible 
account of systematic increases in the voice-voiceless perceptual boundary that 
are associated with shifts from labial to apical to velar stop consonantii (Lisker 
and Abramson, 1970; Abramson and Lisker, 1973). 

If the argument of the last few pages has given the impression that auditory 
boundaries between phonetic categories are readily determined, the Impression 
must be dispelled. The criterion for such boundaries is that they be demon- 
strated in a nonspeech analog, a feat that has proved peculiarly difficult for 
the voiced stop consonants. The typical outcome of studies in which formant 
patterns controlling consonant assignments are removed from context and presented 
for discrimination is that they are perceived continuously (e.g., Mattingly, 
Liberman, Syrdal, and Halwes, 1971). A striking instance is provided by the work 
of Popper (1<)72). He manipulated F2 transitions within a three-fomant pattern 
(cf. Figure 3) to yield a synthetic series from /ab/ to /ad/. He then measured 
energy passed by a 300 Hz band-width filter, centered around the F2 steady-state 
frequency, and noted a sharp drop at the /b-d/ boundary both for isolated F2 and 
for the full formant pattern. However, subjects evinced the expected discrimina- 
tion peak only for the full pattern: the isolated F2, deapite its acoustic dis- 
continuity, was continuously perceived. 

In short, no simple notion of fixed regions of auditory sensitivity serves 
to account for categorical division even of the /ba,da,ga/ continuum, let alone 
for perceptual invar iance across phonetic contexts, for the normalizing shifts in 
category boundary associated with speaker variation (cf . Fourcin, 1968; Rand, 
1971), or for croas-langtiage differences in boundary placement. The problem is 
not confined to articulatory place distinctions. Consider, for example, the fact 
that Spanish speakers typically yield a aomewhat shorter labial VOT boundary than 
do English (Lisker and Abramson, 1964) and that their perceptual boundary shows a 
corresponding reduction (Lisker and Abramson, 1970). We can hardly account for 
the perceptual shift by appeal to an inherently sharp threshold. Precise 
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categpry position along • continuum !■ clearly a function of llngulatlc experi- 
ence (tee also Stevens, Lf.berm&n, Studdert-Kennedy, and Ohaan, 1969). Popper 
(1972) proposes, in fact, that "people vho speak different languages may tune 
their auditory sysreas differently" (p. 218). Differential "tuning" could re- 
sult froB cross-language differences in selective attention to aspects of the 
signal, and in criterion levels for particular phonetic decisions. Given the 
close natch between perception and production (Stevens et al., 1969; Abrainson 
and Lisker, 1970; Llsker end Abraason, 1970), it seess plausible that such dif- 
ferences should arise frwn complex interplay between speaking and listening dur- 
ing language acquialtlon (see below, Fron Acoustic Feature to Phonetic Percept) . 

The notion of "tuning" presupposes the existence of acoustic propertlc ; to 
which the auditory system may be attuned. The first steps toward definition of 
these properties have been taken by Stevens (see especially 1972b, 1973). As 
earlier remarked, Stevens has used spectrographlc analysis and computations from 
an idealized vocal tract model to describe possible acoustic correlates of cer- 
tain phonetic features. He finds, for example, that the spectral patterns 
associated with continuous changes in place of artlculatory constriction along 
the vocal tract do not themselves change continuously. Rather, there are broad 
plateaux, within which changes in point of constriction have little acoustic 
effect, bounded by abrupt acoustic discontinuities. These acoustic plateaux 
tend to correlate with places of articulation in many languages. In short, 
Stevens is developing the preliminaries to a systematic acoustic account of pho- 
netic categories and their boundaries. His work is important for its emphasis 
on the origin of phonetic categories in the peculiar properties of the human vo- 
cal tract. Furthermore, as will be seen below, his approach meshes neatly with 
recent work on auditory feature analyzing systems as the bases of phonetic cate- 
gories. 

Auditory and Phonetic Processes in Categorical Perception 

We turn now to the second main aspect o£ categorical perception—listeners' 
failure to discriminate among members of a category— and to the contrast between 
continuously perceived vowels and categorically perceived stop consonants. A 
long series of experiments over the past few years has shown that Ilsfeners' 
difficulty in discriminating among members of a category la largely due tr* the 
low energy transience of the acoustic signal on the basis of which phenetj ... cate- 
gories are assigned. Lane (1965) pointed to the greater duration and intensity 
of the vowels and showed that they were more categorically perceived if they 
were degraded by being presented in noise. Stevens (1968b) remarked the brief, 
transient nature of stop consonsnt acouatic cuea, and showed, as did Sachs (1969) 
(1969), that vowels were more categorically perceived if their duration and 
acoustic stsbillty were reduced by placing them in CVC syllsbles. 

The role of auditory memory, implicit in the work just cited, was made ex- 
plicit by Fujlsski and Kawaahlma (1969, 1970) in a model of the decision process 
during the ABX trial. If a listener assigns A snd B to different phonetic cate- 
gories (i.e., if A and B lie on opposite sides of a phonetic boundary), his only 
tssk is to determine whether X belongs to the same category as A or as B: his 
performsnce is then good snd a discrimination peak appeara in the function for 
both consonsnts and vowels. However, if a listener aaslgns A or B to the same 
phonetic category, he is forced to compare X with his auditory memory of A and B: 
his performance la then slightly reduced for vowels, for which auditory memory 
is presumed to be relatively strong, but sharply reduced for consonants, for 
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vfalch auditory aeaory Is preauaed to be weak* Evidence for the operation of 
auch a two-step process within phonetic categories In man, but not In monkey, 
has recently been reported by Slnnott (1974)* 

Before we proceed, let us spell out some distinctions between auditory and 
phonetic memory stores* The auditory store, or trace. Is usually assumed to be 
rather like an echo: a faint simulacrum. If not of the waveform, at least of 
Its neural correlates at an early atage of processing* Like an* echo, the trace 
Is an analog of Its original, decays rapidly, and may be displaced If another 
sound arrlvea to Interfere before decay Is complete. The phonetic store, on the 
other hand. Is a set of discrete features. Its decay Is v. good deal slower, and 
Interference can only be accomplished by another phonetic entity with similar 
phonetic features* 

With this In mind, we turn to several experiments by Flsonl (1971, 1973a, 
1973b) m which he tested and supported Fujlsakl and Kawaahlma*s hypothesis con- 
cerning auditory memory for consonants and vowels* In the first (Plsonl, 1973a) 
he varied the A-to-X delay Interval from zero to two seconds In an AX same - dif- 
ferent task for vowel and stop consonant continue* Between-category performance 
(presumably based on phonetic store) was high and Independent of delay Interval 
for both conaonants and vowels; wlthln-category performance (presumably based on 
auditory store) was low and Independent of delay Interval for consonants, but 
for vowels was high and declined systematically as delay Interval Increased* In 
subsequent experiments, Plsonl (1973b) demonstrated that the degree of categor- 
ical (or continuous) perception of vowels can be manipulated by the memory de- 
mands of the dlscrlffllnatloQ paradigm and by the amount of Interference from 
neighboring stimuli (Glanzman and Plsonl, 1973)* 

Changing tack, Plsonl and Lazarus (1974) sought methods of Increasing appar- 
ent auditory memory for stop consonanta* This Is more difficult, but by a par- 
ticular combination and sequence of experimental conditions, they were able to 
demonstrate Improved wlthln-category discrimination on a volce-volceless contin- 
uum. The same continuum (/ba-pa/) alao elicited reaction tjf.u.'' differences In a 
pdr-matchlng taak (Plsonl and Tash, 1974; cf* Posner, Boles, :.lchelman, and 
Taylosr, 1969) * Here, listeners were asked to respond same or different to pairs 
of stimuli drawn from the continuum* Same reaction times were faster for iden- 
tical paira than for acoustically distinct pairs, drawn from the same phonetic 
category; different reaction times decreased as acoustic differences between 
pairs from different categories increaaed* Thia last result recalls Barclay's 
(1972) finding that liatenera can correctly and reliably Judge acoustic variants 
of /d/, drawn from a synthetic continuum, as more sisdlar to /b/ or /g/* If we 
add these studies to our earlier obaervatlon that listeners alwaya display a 
margin of wi thin-cat i;gory discrimination for conaonanta, and that discrimination 
functlona do not display a quantal leap between categorlea, we oaiat conclude 
that the auditory ayatem does retain at least some trace of consonantal passage* 
At the saae time, there is little question that thia trace in fainter than that 
for vowels* 

The conclusion of all these studies is pointed up by the work of Raphael 
(1972). He studied volce-volceless VC continue, manipulating initial vowel dur- 
ation PS the acoustic cue to voicing of the final stop* Here, where the percep- 
tual object was conaonantal, but the acoustic cue vocalic, perception was con- 
tinuous* In short, consonanta and vowela are distinguished in the experiments 
we have been considering, not by their phonetic class or the processes of 
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aasigtment to that claat, but by their accmatic characterlatica and by the dura-* 
tlon of their auditory atores. If the longer store of the vowels is experiaen-- 
tally reduced » their membership in the natural class of segmental phonetic enti- 
ties is revealed by their categorical perception. 

Stages of Auditory Memory 

Several independent lines of research^ drawing on different experifiiental 
paradigms » have recently begun to converge on perceptual and memorial processes 
below the level of phonetic classification. Experimenters often share neither 
terminology nor theoretical framework, but we can discern two, not entirely 
overlapping, lines of division la the perceptual process* The first divides 
short-term memory into a brief store lasting some hundreds of milliseconds, and 
a longer store lasting several seconds* The second divides peripheral from cen- 
tral proceases; this is importasit, but we will not consider it in detail here, 
since the cut cannot be as surely made in audition as in vision (due to incom- 
plete decussation of auditory pathways), and most of the processes to be dis- 
cussed alls certainly central. 

Short-Term Auditory Stores 

Store I . As a step toward further analysis of auditory memory for speech, 
consider the concept of parallel processing. Liberman et al. (1967) used this 
term to describe the decoding cf a CV syllable, in which acoustic correlates of 
consonant and vowel are distributed over an entire syllable (Liberman, 1970). 
Obviously, the process requires a store at least as long as the syllable to 
register auditory information, and presumably somewhat longer to permit transfer 
into phonetic code. 

Direct evidence of this type of parallel processing comes from several 
sources. Liberman et al. (1952) showed that the phonetic interpretation of a 
stop release burst varied with its following vowel, and concluded that we per- 
ceive speech over stretches of roughly syllabic length (cf. Schatz, 195A). 
Lindblom and Studdert-Kennedy (1967) demonstrated that the phonetic boundary 
for a series of synthetic vowels shifted as a function of the slope and direc- 
tion of initial and final formant transitions: listeners Judged vo%iels in rela- 
tion to their surrounding consonantal frames (cf. Fujlmura and Ochiai, 1963; 
Strange et al., 1974). Hore recently, Fisonl and Tash (1974) have studied reac- 
tion time to CV syllables: they called for same - different Judgments on vowels 
or consonants of syllable pairs in which nontarget portions of the syllables 
were also either the same or different. Whether comparing consonants or vowels, 
listeners were consistently faster when target and nontarget portions of the 
syllable were redundant (i.e., both same , or both different). In other words, 
informatloa from an entire syllable contributed to listeners* decisions concern- 
ing '^segments** of the s/llable. In a related study by Wood and Day (in press), lis- 
teners identified either the vowel or the consonant of synthetic CV syllables, 
/ba,da,bae ,d£ /. If all test items were identical on the nontarget dimension 
(i.e., if all had the same vowel on a consonant test, or all the same consonant 
on a vowel test), subjects* reaction tines were significantly faster than If 
both target and nontarget dimensions varied. In the latter case, the unattended 
vowel (consonant) retarded listeners* decisions on the attended consonant (vow- 
el). In short, we have a variety of evidence that, for at least some syllables, 
consonant and vowel recognition are Interdependent, parallel processes, requir- 
ing a short-term auditory store of at least syllabic duration. 
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Massaro (1972) aade the functiooal distinction between such a "perceptual 
auditory Image" and a longer "synthesized" auditory store; he initiated attempts 
to estimate duration of the "Image" by backward masking studies. First discov- 
ered in visual experiments (Werner, 1935), the paradigm takes advantage o£ the 
fact that perception of a stimultis may be blocked if a second stimulus is pre- 
sented some hundreds of milliseconds later; it has been used to good effect in 
vision to separate and describe peripheral and central processes (Turvey, 1373). 
However, the belief that the critical inter stimulus Interval (ISI) , at which the 
first stimulus is freed from Interference by the second, may be taken as an 
estimate of the duration of primary auditory display (Massaro, 1972) is difficult 
to sustain, and application of the technique to the study of speech perception 
has proved problematic for several reasons. 

To begin with, auditory information is displayed over time, so that percep- 
tion of a target CV syllable of natural duration (say, 200-300 msec) can be in- 
terrupted only by a masking syllable that begins before the first syllable is 
complete. Temporal relations between syllables must then be expressed in terms 
of stimulus onset asynchrony (SOA) rather than in terms of ISI, and the effec- 
tiveness of the mask is reduced because It is itself masked by the first sylla- 
ble (forward masking). For example, Studdert-Kennedy, Shankweiler, and Schulman 
(1970b) found that the first syllable was completely freed from masking by the 
second at a SOA of 50 msec, certainly an underestimate of display time, since it 
la no more than the duration of the critical consonant information in the for- 
mant transitions of the target Of syllable. 

There are two solutions to this Impasse: make the syllables unnaturally 
short, or present target and mask to opposite ears (dlchotically) , thus evading 
peripheral masking of the second syllable. Several investigators (Massaro, 
1972; Pisoni, 1972; Dorman, Kewley-Port, Brady-Wood, and Turvey, 1973) have 
attempted the first solution. Results are difficult to interpret because both 
the degree of masking and the critical ISI for release from masking vary with 
target (consonant or vowel), size and range (acoustic or phonetic) of target set, 
target and mask energy, relations between target and mask structure (acoustic or 
phonetic) , and Individual listeners, many of whom show no masking whatever even 
for brief (15.5 msec) vowels (Dorman et al., 1973). Where masking could be ob- 
tained, the shortest critical ISI observed in these studies (80 msec) was for 40 
msec steady-state vowels, and the longest (250 msec) for 40 msec CV syllables 
(Pisoni, 1972). Note, incidentally, that complete absence of masking has been 
observed only with vowels, and Just as categorical perception of vowels can be 
Induced by degrading them with noise, 90 too can their masking (Dorman, Kewley- 
Port, Brady, and Turvey, 1974). 

In any event, these variable results do not encourage one to believe that 
critical ISI Is measuring the fixed duration of auditory display. And the case 
is no better when we turn to dichotic masking paradigms. Pisoni and McNabb 
(1974), for example, observed a critical SOA for release from dichotic backward 
masking of between 20 and 150 msec, depending upon target and mask vowel rela- 
tions. A somewhat longer estimate of 200-250 msec can be extrapolated from the 
data of Studdert-Kennedy et al. (1970b). A narrower estimate cooes from McNeill 
and Repp (1973b). They studied forward masking of dlchotically presented CV syl- 
lables, determining the SOA necessary for features of the leading syllable to 
have no further effect on errors in the lagging syllable, and so presumably to 
have passed out of the phonetic processor. Their estimate of 80-120 msec may be 
more realistic for running speech than others, since their procedure eliminated 
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a component present In all previous studies, namely, time taken to prepare a re- 
sponse, a period during vhlch effective Interruption may still occur (Repp, 
1973). 

Hovever, It Is more likely that the entire endeavor Is misguided. It seems 
Intuitively plausible that syllable processing time Is not constant, but varies, 
under automatic attentlonal control, with speaking rate and other factors. The 
studies reviewed are elnply measuring time required for release from masking 
under a variety of more or less adverse conditions. This Is certainly not with- 
out Interest, particularly If we can show It to be a function of well-specified 
target-mask relations. But we shall then be turning attention away from the no- 
tion of a primary auditory store, and toward the more Important question of what 
acoustic dimensions are extracted In the very earliest stage of processing, and 
how they Interact to determine the phonetic percept. 

Store II . Nonetheless, some form of auditory store Is clearly necessary. 
We would otherwise be unable to Interpret the prosody of runxilng speech, and 
there Is aaq>le experimental evidence of cross-syllabic auditory Interaction 
(Haddlng-Koch and Studdert-Kennedy, 1964; Studdert-Kennedy and Haddlng, 1973; 
Atkinson, 1973). Detailed analysis of this longer store, perhaps lasting sev- 
eral seconds, was made possible by the work of Crowder and Morton (1969; see 
also Crowder, 1971a, 1971b, 1972, 1973). They were the first experimenters to 
undertake a systematic accotint of whaL they termed "precategorlcal acoustic 
storage" (PAS). 

Evidence for the store comes from studies of lomedlate, ordered recall of 
span-length digit lists. Typically, error probability Increases from beginning 
to end of list, with some slight drop on terminal items (recency effect). The 
terminal drop is significantly Increased, if the list is presented by ear rather 
than by eye (modality effect) . Crowder and Korton (1969) argue that these two 
effects reflect the operation of distinct visual and auditory stores for pre- 
categorical (prelinguistic) Information, and of an auditory store that persists 
longer than the visual. Support comes from demonstrations that the recency 
effect is significantly reduced, or abolished, if subjects are required to re- 
call the list by speaking rather than by writing (Crowder, 1971a), or if an audi- 
tory list is followed by a redundant, spoken suffix (such as the word zero) , as 
a signal for the subject to begin recall (suffix effect). That the suffix Inter- 
feres with auditory, rather than linguistic, store is argued by the facts that 
the effect (1) does not occur if the suffix is a tone or burst of noise; (2) is 
unaffected if the spoken suffix is played backward; (3) Is unaffected by degree 
of semantic similarity between suffix and list; (4) is reduced if suffix and 
list are spoken in different voices; and (5) is reduced if suffix and list are 
presented to opposite ears. 

Of particular interest in the present context is that all three effects 
(modality, recency, suffix) are observed for CV lists, of which meoibers differ 
in vowel alone, or in both vowel and consraant (spoken letter names), but not 
for voiced stop consonant CV or VC lists, of which members differ only in the 
crnsonant (cf. Cole, 1973b). Crowder (1971a$595) concludes that "vowels re- 
Cv^ve some form of representation in PAS while voiced stop consonants receive 
none." Liberman, Mattingly, and Turvey (1972:329) argue further that phonetic 
classification "strips away all auditory information" from stop consonants. 
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However, this last claim la unlikely to be true. First, there Is no good 
reason why the process o£ categorization should affect vowels and consonants 
differently.. Second, we have a variety of evidence that listeners retain at 
least some auditory trace of stop consonants (see previous section). Third, 
consonant and vowel differences In PAS can be reduced by appropriate manipula- 
tion of the signal array (Darwin and Baddelay, 1974). These Investigators demr 
onstrated a recency effect for tokens of a stop CV, /ga/, and two highly dls- 
crldnable CV syllables In which the consonantal portion Is of longer duration, 
//a/, /ma/. They also demonstrated that the recency effect for vowels can be 
eliminated If the vowels are both very short (30 msec of a 60 msec CV syllable) 
and close neighbors on an F1-F2 plot. They conclude that "the consonant-vowel 
distinction Is largely Irrelevant" (p. 48) and that Items In PAS cannot be reli- 
ably accessed If, like /ba,da,ga/ or /l,e,s/, they are acoustically similar. 
The effect of acoustic similarity Is, of course, to confound auditory memory. 
As we shall see shortly ( The Acoustic Syllable , below) and, as Darwin and 
Baddeley (1974) themselves argue. It Is to the more general concept of auditory 
memory that we must have recourse. If we are to understand the full range of ex- 
periments In which consonant-vowel differences have been demonstrated. 

We turn now to the duration of PAS and the mechanisms underlying Its re- 
flection In behavior. Notice, first, that If an eight-Item list Is presented at 
a rate of two per sec and Is recalled at roughly the same rate, time between 
p presentation and recall will be roughly equal for all Items. Therefore, the re- 
cency effect cannot be attributed to differential decay across the list, but- Is 
due rather to the absence of "overwriting" or Interference from succeeding Items. 
Second, since the degree of Interference (I.e., probability of recall error) de- 
creases as the time between Items Increases, and since the suffix effect virtu- 
ally disappears If the Interval between the last Item and suffix Is Increased to 
2 aec, we are faced with the paradox that performance Improves as time allowed 
for FAS decay Increases. Crowder*s (1971b) solution Is to posit an active 
"read-out" or rehearsal process at the artlculatory level. Time for a covert 
run through the list Is "...a second or two" (p. 339). If a suffix occurs dur- 
ing this period, PAS for the last couple of Items Is spoiled before they are 
reached; If no suffix occurs, the subject has time to check his rehearsal of 
later Items against his aiulltory store, and so to confirm or correct his prelim- 
inary decision. Crowder (1971b) goes on to show that there Is, In fact, no evi- 
dence for any decay In PAS: In the absence of further Input, PAS has an infi- 
nite duration. This Is Intuitively Implausible, but we will not pursue the mat- 
ter here. 

Notice, however, that the term precategorlcal refers to the nature of the 
Information stored, not to the period of time during which It Is stored. A pre- 
liminary (or even final) artlculatory. If not phonetic, decision must have been 
made before PAS Is lost. If rehear aal Is to permit cross-check with the store. 
We are thus reminded of the temporary auditory store hypothesized in the analy- 
sls-by-synthesls model of Stevens (1960, 1972a). Crowder *s account, with Its 
preliminary analysis and generative rehearsal loop, is so similar to Stevens' 
model that we may be tempted to Identify the two, add to see evidence for PAS 
function as support for Stevens' hypothesis. 

We may remark, however, one important difference. Stevens Introduced a 
synthesis loop to handle the Invariance problem, a problem at its most acute for 
stop consonants. But these are precisely the items excluded from PAS, snd all 
our evidence for consonantal auditory memory suggests a store considerably less 



25 

ERIC 31 



than InflAltft, probably less than a second • We may* of course, assuasa that a 
synthesis loop goes into operation very early in the process, vhlle consonant 
auditory information is still available, and that the PAS rehearsal loop is 
simply a sustention beyond the point at which stop consonantal auditory informa- 
tion can be accessed. We would then be forced to posit ihe decay of consonantal 
information from auditory store* Continuation of the loop might be automatic 
during numlng speech, enabling prosodic pattern to emerge, but under attention* 
al control' for special purposes, such as listening to poetry and remeoibering. 
telephone nusibers. But we have, at present, no direct evidence for the earlier 
stage of the loop* 

Stages of Processing 

Nor, as we have seen, do we have direct evidence for the primary auditory 
store inferred from parallel processing. We may, in fact, do well to dismiss 
division of the process into hypothetical stores, and concentrate attention on 
the types of information extracted during early processing, and their interac- 
tions* Several experis^ntal paradigms have already been applied. 

Day and Wood (1972) and Wood (1974) have reported evidence for parallel ex- 
traction of pitch (fundamental frequency) and spectral information bearing on 
segmental classification. For the first experiment they synthesized two CV syl- 
lables, /ba»da/, each at two pitches, and prepared two types of random test 
order. In one, they varied a single dimension, either fundamental frequency or 
phonetic class; in the other, they varied both dimensions independently. They 
then called on subjects to identify, with a reaction-^time button, either pitch 
or phonetic class, each in its appropriate one-dimensional test and also in the 
two-dimensional test. Reaction times were longer for both tasks on the two-di- 
mensional test than on the onc-dimec&sional test, but the Increase was signifi- 
cantly greater on the phonetic test than on the pitch task: unpredictable pitch 
differences interfered with phonetic decision more than the reverse. The authors 
took this finding as evidence for separate nonlinguistic and linguistic process- 
es, the first mandatory, the second optional. In a follow-up experlmnt. Wood 
(1974) substituted a two-dimensional test in which fundamental frequency and 
phonetic class variations were correlated rather than independent. Keaction 
times were now significantly shorter for both tasks on the two-dimensional test: 
subjects drew on both pitch and phonetic information for either pitch or phonet- 
ic classification. Wood (1974) concludes that the two types of Information are 
separately and simultaneously extracted [as required, incidentally, by Stevens* 
(1960) model]. 

There is wore to these experiments. The phonetic task called for a deci- 
sion on the consonant (/ba/ vs /da/), but pitch information was primarily carried 
by the vowel. In fact, had fundamental frequency differences been carried solely 
by initial formant transitions, it is doubtful whether they would have interacted 
with phonetic decision. Dorman (1974) has shown that listeners are unable to 
discriminate intensity differences carried by the 50 msec initial transitions of 
a voiced stop CV syllable, but are well able to discriminate identical differ- 
ences carried by isolated transitions, or by the first 50 msec of a steady-state 
vowel. While the experiment has not been done, it seems likely that Dorman *8 
results would have held had he used fundamental frequency Instead of intensity. 
We would then be forced to conclude that, in Wood^s (1973b) experiment, subjects 
were using adventitious pitch information carried by the vowel to facilitate 
Judgment of the consonant, and vice versa. The experiments thus reflect parallel 
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processing* both of linguistic and nonlinguiatic information and of consonant 
and vowal. 

Experimental separation of auditory and phonetic processes has also been 
attempted in dichotic studies. Consider, for example, the following series. 
Shankweiler and Studdert-Kennedy (1967; also Studdert-Kennedy and Shanlmeiler* 
1970) found that listeners were significantly better at identifying the conso- 
nants of dichotically competing CV or CVC syllables if the consonants shared a 
phonetic feature than if they did not. Since the effect was present both for 
pairs sharing vowel (e.g., /bi.di/, /du.tu/, etc.) and for pairs not sharing 
vowel (e.g., /bi.du/, /di.tu/, etc.), and since the latter pairs differ markedly 
in the auditory patterns by which the shared features are conveyed, Studdert- 
Kennedy, Shankweiler, and Pisoni (1972) concluded that the effect had a phonetic 
rather than an auditory basis. In another experimental paradigm, Studdert- 
Kennedy et al. (1970b) presented CV syllables at various values of SOA and dem- 
onstrated dichotic backward masking. They attributed the masking to interrup- 
tion of central processes of speech perception, but left the level at which the 
interruption occurred uncertain (cf . Klrsteln, 1971, 1973; Porter, 1971; Berlin, 
Lowe-Bell, Cullen, Thompson, and Loovis, 1973; Darwin, 1971a). 

Recently, Pisoni and McNabb (1974) have combined and elaborated the two 
paradigms in a dichotic feature-sharing study, varying both masks and SOA. 
Their targets were /ba,pa,da,ta/; their masks were /ga,ka,gae ,ks ,ge,ke/. If 
target and mask consonants shared voicing, little or no masking was observed. 
If they did not share voicing, masking of the target consonant increased both as 
the masking-sy liable vowel approached target-syllable vowel from /e/ through 
/ae/ to /o/, and as the mask intensity increased. In other words, identifica- 
tion of the target consonant was facilitated by similarity of the masking conso- 
nant, but, in the absence of facilitation, was Impeded by similarity of the 
masking vowel, particularly if the vowel was of relatively high intensity. In a 
theoretical discussion of these results, Pisoni (in press) concludes that masking 
and facilitation occur at different stages of the perceptual process: masking 
reflects integration (rather than interruption) at the auditory level, while 
facilitation reflects integration at the phonetic level. 

However, these results are also open to a purely auditory interpretation. 
They seem, in fact, to be consistent with a system that extracts the acoustic 
correlates of voice onset time separately for each vowel context (cf . Cooper, 
197 Ab). Ue are thus led to consider the possible role of discrete acoustic fea- 
ture analyzing systems, tuned to speech. This has proved among the most fruit- 
ful approaches to analysis of early processing, but we defer discussion to a 
later section (see below, Feature-Analvzing Systems ) . 

The Acoustic Syllable 

We have now touched on some half dozen paradigms — categorical perception, 
backward masking, short-term memory, reaction time studies, and others— in which 
consonant and vowel perception differ. As a final example, we may mention di- 
chotic experiments [Berlin (in Lass, in press)]. Shankweilet and Studdert- 
Kennedy (1967; also Studdert-Kennedy and Shankweiler, 1970) showed a significant 
right-ear advantage for dichotically presented CV or CVC syllables differing in 
their initial or final consonants, but little for steady-state vowels or CVC 
syllables differing in their vowels. Day and Vigorito (1973) and Cutting (in 
press-b) reported a hierarchy of ear advantages in dichotic listening from a 
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rlght-^ar advantage for stop consonants through liquids to a null or small left- 
ear advantage for vowels. Recently, Velss and House (1973) have demonstrated 
that a rlght«>ear advantage emerges for vowels, If they are presented at suitably 
unfavorable signal- to-nolse ratios, while Godfrey (1974) has shown that the 
rlght^ear advantage for vowels may be Increased by adding noise, reducing dura- 
tion, or using a more conf usable set of vowels (cf. Darwin and Baddeley, 1974). 

The pattern Is familiar. In virtually every 'Instance, a consonant-vowel 
difference can be reduced or eliminated by taxing the listener's auditory access 
to the vowel, or by sensitizing his auditory access to the consonant. These 
qualifications only serve to emphasize the contrast between them, and to pin- 
point Its source In their acoustic structure. The consonant Is transient, low 
In energy, and spectrally diffuse; the vowel Is relatively stable, high In 
energy, and spectrally compact. 

Together they form the syllable, each fulfilling within It some necessary 
function. Consider, first, vowel duration. Long duration Is not necessary for 
recognition. Ve can Identify a vowel quite accurately and very rapidly from 
little more than one or two glottal pulses, lasting 10 to 20 msec. Yet In run- 
ning speech, vowels last ten to twenty times as long. The Increased length may 
be segmentally redundant, but It permits the speaker to display other useful 
Information: variations In fundamental frequency, duration, and Intensity with- 
in and across vowels offer possible contrasts In stress and Intonation, and In- 
crease the potential phonetic range (as In tone languages). Of course, these 
gains also reduce the rate at which segmental Information can be transferred. 
Increase the duration of auditory store, and open the vowel to contextual ef- 
fects—the more so, the larger the phonetic repertoire. A language built on 
vowels, like a language of cries, would be limited and cumbersome. 

Adding consonantal **attack" to the vowel Inserts a segment of acoustic con- 
trast between the vowels, reduces vowel context effects > and Increases phonetic 
range. The attack. Itself part of the vowel [the two produced by *\..a single 
ballistic movement** (Stetson, 1952:4)], Is brief, and so Incresses the rate of 
Information transfer. Despite Its brevity, the attack has a pattern arrayed In 
time, and the full duration of Its trajectory Into the vowel Is required to dis- 
play the pattern. To coiiq;>ute Its phonetic Identity, time Is needed, and this Is 
provided by the segmentally redundant vowel. Vowels are the rests between con- 
sonants. 

Finally, rapid consonantal gestures cazmot carry the melody and dynamics of 
the voice. The segmental and supr^segmental loads are therefore divided over 
consonant and vowel: the first, %d.th Its poor auditory store, taking the bulk 
of the segmental load; the second t^Jclng the suprasegmental load. There emerges 
the syllable, a symbiosis of consonant and vowel, a structure shaped by the ar- 
tlculatory and auditory capacities of Its user, fitted to, defining, and making 
possible linguistic and parallngulstlc communication. 

SPECIALIZED NEURAL PROCESSES 

Cerebral Lateralization 

That the left cerebral hemisphere Is, In most persons, specialized for lan- 
guage functions Is among the most firmly established findings of modem neurol- 
ogy. That one of those functions may be to decode the peculiar acoustic 
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structure of the syllable Into its phonetic conponents vae first suggested by 
the results of dichotic studies. Klaurs (1961s» 1961b, 1967) discovered that if 
different digit triads vere presented sloultaneously to opposite ears, those 
presented to the right ear vere more accurately recalled than those presented to 
the left. She attributed the effect to functional prepotency of contralateral 
pathways under dichotic competition, and to leftohemisphere specialization for 
language functions. Later experiaents have amply supported her interpretation. 

Shankweiler and Studdert-Kennedy (1967) applied the technique to analysis 
of speech perception. They demonstrated a significant right-ear advantage for 
single pairs of nonsense syllables differing only in initial or final stop con- 
sonant, and separable advantages for place of articulation and voicing (Studdert- 
Kennedy and Shankweiler, 1970; cf. Halves, 1969; Darvin, 1969; Haggard, 1971). 
Among the questions raised by these studies vas whether the left hemisphere was 
specialized only for phonetic analysis, or also for extraction of speech-related 
acoustic properties, such as voice onset, formant structure, temporal relations 
among portions of the signal, and so on. We vlll not rehearse the argument 
here, but simply state the conclusion that "while the auditory system consnon to 
both hemispheres is equipped to extract the auditory paran^ters of a speech sig- 
nal, the dominant hemisphere may be specialized for the extraction of linguistic 
features from those parsmeters" (Studdert-Kennedy and Shankweiler, 1970:594). 

Striking evidence in support of this conclusion has recently been gathered 
by Wood (1975) and Wood, Goff, and Day (1971). This work deserves careful 
study, as an exemplary instance of the use of electroencephalography (EEC) in 
the study of language-related neurophysiological processes. Wood synthesized 
two CV syllables, /ba/ and /ga/, each at two fundamental frequencies, 104 Hz 
(lov) and 140 Hz (high) . From these syllables he constructed two types of ran- 
dom test order: In one, items differed only In pitch [e.g., /ba/ (low) vs /ga/ 
(high)]; in the other, they differed only in phonetic class [e.g., /ba/ (low) vs 
/ga/ (low)]. Subjects were asked to identify either the pitch or the phonetic 
class of the test items with reaction-time buttons. While they did so, evoked 
potentials were recorded from a temporal and a central location over each hemi- 
sphere. Records from each location were averaged and coo^ared for the two types 
of test. Notice that both tests contained an identical item [e.g., /ba/ (low)], 
identified on the same button by the same finger. Since cross-test comparisons 
were made only between E£G records for identical items, the only possible source 
of differences in the records was in the task being performed, auditory (pitch) 
or phonetic. Results showed highly significant differences between records for 
the two tasks at both left-hemisphere locations, but at neither of the right- 
hemisphere locations. A control experiment, in which the "phonetic" task was to 
identify isolated Initial formant transitions (50 msec), revealed no significant 
differences at either location over either hemisphere. Since these transitions 
carry all acoustic information bv which the full syllables are phonetically dis- 
tinguished, and yet are not recognizable as speech, we may conduoi that the or- 
iginal left-hemisphere differences arose during phonetic, rather than auditory, 
analysis. We will discuss the adequacy of isolated formant transitions as con- 
trol patterns in the next section. However, the entire set of experiments 
strongly suggests that different neural processes go on during phonetic, as 
opposed to auditory, perception in the left hemisphere, but not in the right 
hemisphere (cf. Mblfese, 1972). 

The distinctive processes of speech perception would seem then to lie in 
linguistic rather than acoustic analysis. Two other types of evidence suggest 



29 



35 



th* Mae conclusion. Flrtt» visual studies have rspsatedly shovn s right-field 
(lef t'-htaisphere) edventege for techistoscopicslly presented letters end, by 
contrast, e left-field (right-heaisphere) edvantage for nonlinguistic geoaetric 
foras (for e' review, see Klaura and Dumford, 1974). Second, Papgua, Krashen, 
Terbeek, Reaington, and Uarshasn (1974) and Krashen (1972) have shown a right- 
ear advantage in experienced Horse code operators for dichotically presented 
Morse code words end letters. If the erbitrery pettems of both e visual end en 
euditory elphebet can engege left-heaisphere aechenisas, there aight seea tu be 
little ground for deialng special status for the speech signal. 

However, elphabets ere secondery, and while their interpretetion aay well 
engege specialized linguistic aechanisas, enalysls of their erbitrery signal 
patterns deerly should not. The speech signal, on the other hand, is primary, 
its acoustic pattern et once the naturel reelizetion of phonological systea and 
the necessery source of phonetic percept. Given its speciel stetus and peculler 
structure, we should perhaps be surprised less if there were, then if there were 
not, specialized aechenisas adapted to its euditory analysis. 

Hints of such processes have begun to eppear. Helper in, Nechshon, end 
Caraon (1973), for exeaple, showed e shift froa left-eer edventage to right-eer 
advantage for dichotically presented tone sequences es a function of the nuaiber 
of alternations in the sequence. Their stimuli were pettemed penaitetions of 
brief (200 asec) tone bursts, presuaably not unlike those of Papcun et al. 
(1974), who showed a right-ear adventage in naive subjects for Horse code pat- 
terns up to seven units in length. Both studies suggest left-heaisphere speciel- 
izetion for assessing the sort of teaporel reletions iaportant in speech. Both 
studies suffer froa heving celled upon subjects to label the patterns, a process 
that aight well invoke left-heaisphere mechsnisas. 

This weakness is evoided in recent work by Cutting (in press-b). He syn- 
thesized two noraal CV syllebles, /be/ and /da/, and two phonetlcelly iapossible 
"sylleblee" identical with the foraer except that their first foraant transi- 
tions fell rather than rose along the frequency scale, so thet they were not 
recognized as speech. In a nonlebellng dlchotic task, subjects gave equal right- 
ear adventages for both types of stlaulus. The outcoae suggests e left-heai- 
sphere aechanisa for extrection of foraant trensitions end is ro&iniscent of a 
study by Darwin (1971b), who found a right-eer edvantage for synthetic frlcsr* 
tives when foraent trensitions from fricetive noise into vowel were included, 
but no eer edventege when trensitions were excluded. 

There are, then, grounds for believing thet the left heaisphere is special- 
ized not only for phonetic interpretation of an auditory input, but also for ex- 
traction of auditory inforaation froa the acoustic signal. The evidence is ten- 
uous, but rysteaetic study of feeture-enalyzing sys teas— whether lateralized or 
not remains to be seen (cf . Ades, 1974e)— hes opened up a new range of possibil- 
ities. 

Feature-Analyzing Systeas 

Neurophysiologlcel systeas of feeture detectors^ selectively responsive to 
light patterns, were first reported by Lettvin, Hstursna, HcCulloch, and Pitts 
(1959) . They found receptive fields in the visual gsnglion cells of frog that 
responded, under specific conditions, to n»veaent. The biological utility of 
the systea to en anlael that preys on flies is obvious. Moving up the nervous 
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•yatea, and the tvolutloQAxy acalet Hubel and Vieael (1962) reported yet more 
ccmplex detectors: alngie cells in the vlaual cortex of cat that responded ae* 
lectlvely to the orientation of lines, to edges, and to aoveaent in a certain 
direction. Since then, vork on visual feature-detecting systess has prolifer- 
ated (see Julesz, 1971:58-68, for a review). 

Cotqilex auditory feature detectors in the cortex of cat were reported by 
Evans and Whitfield (1964): single cella responsive to specific gradients of 
intensity change, and others ('Halaow" cells) to the rate and direction of fre- 
quency change (Whitfield and Evans, 1965). Sinilar cells were reported by 
Nelson, Erulkar, and Bryan (1966) in the inferior colliculus of cat. Other re- 
search has borne directly on acoustic signaling systems. Frishkopf and 
Goldstein (1963) and Capranica (1965) reported single units in the auditory 
nerve of bullfrog reaponaive only to the male bullfrog's mating call. Recently, 
VoUberg and Newman (1972) have described single cells in the auditory cortex of 
squirrel monkey which answer to that species' "isolation peep." Stimulus and 
response were isomorphic: presentation of the "peep" with portions gated out 
yielded a response in which corresponding portions were abs«cit. Furthermore, 
the remaining portions were no longer normal: if a central portion of the sig- 
nal was missing, the response pattern to the final portion changed. Interaction 
of this kind is particularly interesting in light of the contextually variant 
cues of speech, for which interpretation may demand details of a complete pat- 
tern, such as the syllable. 

The relevance of all this to speech has not gone \innoticed. The possible 
role of feature-detecting aystems in speech perception was scouted briefly by 
Liberman et al. (1967), by Studdert-Kennedy (197A), and, at considerable length* 
by Abbs and Sussman (1971). However, advance awaited a telling experimental 
procedure. This was found in "adaptation" studies, a method with a long history 
in visual research (Woodworth and Schlosberg, 1954). The paradigm is simple 
enough. For exaiq>le, after prolonged fixation of a line curved frcm the median 
plane, a vertical line, presented as a test stisulua, appears curved in the 
opposite direction: there is a "figural after-effect" in which portions of the 
image are displaced (KBhler and Vallach, 1944) . Related effects in color and 
tilt also occur. While none of these effects is understood in any detail, they 
are frequently interpreted in terms of specific receptors or of feature-analyz- 
ing systems. Prolonged stimulation "fatigues" or "adapts" one system, and rela- 
tively "sensitizes" a physically adjacent or related (perhaps opponent) system. 
On this interpretation, to demonstrate perceptual shifts upon prolonged exposure 
to a particular physical (or psychological) "feature" is to demonstrate the 
presence of analyzing systems for that feature, and its relative. 

The method was first used by Warren and Gregory [1958; see also Warren, 1968 
(in Lass, in press); Perl, 1970; Clegg, 1971; Lass and jolden, 1971; Lass and 
Gasperini, 1973; Lass, West, and Taft, 1973; Obusek and Warren 1973], yielding an 
effect that they termed "verbal transformation." Subjects listen to a nesning- 
ful word played repeatedly once or twice per second for several minutes, and are 
asked to report any changes in the word that they hear. They report a large 
nuiaber of transformations* usually meaningful words and not always closely re- 
lated phonetically to the original. However, Goldstein and Lackner (in press) 
refined the method by using nonsense syllables to reduce semantic Influence (CV, 
V, VC) and by presenting them monaurally. They analyzed transforms phonetically, 
and showed that each was confined to a single phone, usually on one or two dis- 
tinctive features (as defined by Chomsky snd Halle, 1968), and were independent 
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of their syllabic context. Purthenaore, the right eer gave •Ignlflcantly aore 
trans foraui than the left ear on conaonenta, but not on vowels, and the trans- 
forms followed the phonological constraints of English. These last two points 
are aaong the arguaenta that the authora preaent for auapecting that the effects 
result fron adaptation of phonetic, rather than auditory, analysing systems. 

In a fuxther refinement, Lackner and Goldstein (in press) used a natural CV 
syllable, repeated monaurally 36 times in 30 sec, and a final test item pre- 
sented to either the same or the opposite ear. Both adapting and test items 
were drawn from the set of six English stop consonanta, followed by the aame 
vowel (either /i/ or /e/). Subjects reported the last adapting item and the 
test Item. Transfonas in the test item occurred on both cross-ear (30 percent) 
and saae-ear (40 percent) trials. They were significantly more likely to occur 
if the final adapting item was also transformed, and to be on the same fea- 
ture(s) (place and/or voice) as the adapting item transform, a result that again 
hints at phonetic feature-detecting systems. The authors conclude from the 
cross-ear trials that adaptation is central, rather than peripheral, but, unable 
in this study to distinguish phonetic effects from the acoustic effects that un- 
derlie them, they withhold judgment on whether the transforms are auditory or 
phonetic. 

This last is, of course, the crucial question. It can be approached only 
by use of synthetic speech in which acoustic features can be specified precisely 
and, within limits, manipulated independently of phonetic category. Eimas, 
working independently of the previous authors, took this step in a series of ex- 
periments growing out of his work on infanta (discussed below), and has con- 
cluded that the effect is phonetic. We will consider his work in some detail 
because it Introduced a fruitful paradigm that has already been put to good use 
by others. 

In the first experiment (Eimas and Corbit, 1973), the authors used two 
voice-voiceleas series synthesized along thn VOT continuum, one from /ba/ to 
/pa/, the other from /da/ to /ta/ (Llsker and Abramson, 1964). On the assump- 
tion of two voicing detectors, each differentially sensitive to VOT values that 
lie clearly within its phonetic category, and both equally sensitive to a VOT 
value at the phonetic boundary, the authors reasoned that adaptation with an 
acoustically extreme token of one phonetic type should desensitize its detector, 
and relatively sensitize (a metaphor, not an hypotheais) ita opponent detector, 
to boundary values of VOT, with a resulting displacement of the identification 
function toward the adapting atlmulua. They, therefore, collected unadapted and 
adapted functions for both labial and alveolar series. The adapting stimuli 
were drawn from the extremes of both series, and their effects were tested with- 
in and across series. Figure 5 shows the results for one of their three sub- 
jects (the experiment is taxing and prohibits large aamples, but the other two 
subjects gave similar functions). The predicted results obtain. Furthermore, 
the effect is only slightly weaker acroaa aeries than within. (This result was 
replicated in an experiment, briefly reported in their next paper, for which 
they uaed eight subjects to demonstrate boundary shifts on alveolar and velar 
stop consonant VOT continue after adaptation with labial stops.) In a support- 
ing experiment the authors showed that, following adaptation, the peak in an ABX 
diacrimlnation function is neatly shifted to coincide with the adapted phonetic 
boundary (cf. Cooper, 1974a). 
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Figure 5t Percentages o£ voiced identlflcetion responeea ([b or d]) obtained 
with and without adaptation, for « single subject. The functions 
for the [b,p] series are on the left and those for the [d,t] series 
are on the right. The solid lines indicate the unadapted identifi- 
cation functions; the dotted and dashed lines Indicate the identifi- 
cation functions after adaptation. The phonetic sy^ols Indicate 
the adapting stlaulus. [From Elaas and Corblt (1973) with permis- 
sion of authors and publishers.] 
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In A Mcond study (Blaui Coop«r, And Corbit» 1973) » th« authors rtport 
thrte txp«rl»tiitt. Tbm first d«aonstr«t«s thst thtt tits of th« sdaptstion 
effect is probably central rather than peripheral: it obtains as strongly when 
the adapting atiaulua is presented to one eer and the teat stlaulus to the 
other, es vhen both are preaeoted blnaurally (of. Adea, 1974a). The aecond dear 
onatratea that the effect la not obtained if the adapting ttlaulua la aliaply the 
flrat 50 nsec of the syllable /de/, an acouotic pattern that contains all the 
voicing infotmation, but la not heerd aa apeech. The third experlnent aasesses 
the reletive strengths of the two hypothesized detectors, finding that, aa in 
the flrat study (see Figure 5), voiced stops tend to be note resistant to adap- 
tation (yield smaller boundary ahlfta) than voiceleoa. The reault encourages 
the hypotheala of aeparate detectoro for each phonetic value along en acouatlc 
contlnuuB, a notion with obvloua relevance to categorlcel perception. Addition- 
al aupport coaaa froa the work of Cooper (1974a), who found evidence of three 
dlotlnct detectors elong e /b-d-g/ ceatlnuua: adaptation with /b/ shifted only 
the /b-d/ boundary, adaptation with /g/ ahlfted only the /d»g/ boundery, adapta- 
tion with /d/ shifted both neighboring boundarlea. 

Let us remark first the striking achievement of these studies. Whatever 
the underlying nftchanlsa, Eloaa and his colleegues have demonstrated in a novel, 
direct, and peculiarly convincing nanner the operation of some form of feature- 
analyzing system In apeech perception. The outcone wee not foregone. There 
might, after all, have been no adaptation effect at all. Alternatively, the ef- 
fect T&ight have been on the whole ay liable or on the unanalyzed phonemic segment. 
But these possibilities vere ruled out by the cross-series results. The effect 
proved to be on a feature within the phonemic segment, and so has provided the 
strongest evidence to date of a physiologically grounded feature aystem (cf . 
Cooper and Blumstein, 1974). 

What now la the evidence for phonetic rather than auditory adaptation? 
First, the cross-series effect: phonetic tokens draun from labial, alveolar, or 
velar VOT continue differ acoustically in the extent and direction of their 
second and third formant transitions, yet they are mutually effective adaptors* 
If the effect were acoustic, the argument rune, the acouatlc differencea should 
eliminate the effect. NOte, however, that the differencea were In ecousLlc cues 
to place of articulation, while the feature being tested was voice onset time. 
The cues to this feature are complex and, aa we have aeen, relational. Further- 
more, Cooper (1974b) haa recently ahown that VOT adaptation may be vowel-apedf- 
Ic: simultaneous adaptation with [da] and [t^i] produced opposite shifts on 
[ba-p^a] and [bi-pH] aeriea. Nonetheleaa, if outputa from auch detectors fun- 
neled Into acouatlc analyzera, tuned to presence or abaence of energy In the 
region of the first formant at syllable onset, we would expect precisely the re- 
sults that were obtained (cf. Stevena and Klatt, 1974). 

The second piece of evldenci^ la the failure of the truncated /da/, not 
heerd aa speech, to "sensitize" the supposed /te/ detector. Ifere the main prob- 
lem Is the statue of the truncated /d&/ aa a control (cf. Wood, 1975). There 
are two poasible types of design that may throw light on the auditory-phonetic 
lasue. In one, control and teat items are acoustically identical (on dlmenalona 
relevant to the phonetic dlmenaion under teat), but phonetically dlatinct; in 
the other, they are acouatlcally diatlnct, but phonetically identical. The first 
design, chosen by Elmss and hia colleaguea, may yield ambiguous results. If 
adaptation with the control item ahlfta the phonetic boundary, we have evidence 
for the existence of eudltory detectors tuned to acouatlc feetures of speech. 
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Prftcisttly thla outcoae has. In fact, bMn rtportad by Adas (1973), using the 
first 38 msec of the extreme test stlamli to shift the /hm h/dm / boundary. 
If, on the other hand, the control Iten does not shift the boundary, the outcoae 
is aaabiguous. It may mean, as Eimas and his colleagues concluded, that the 
hypothetical detector is phonetic. But it may also mean that an acoustic detec- 
tor tuned to features of speech is only adapted if stimulated by a complete 
(i.e., phonetically Identifiable) signal (cf. VoUberg and Mewman, 1972). It is 
not, after all, implausible to suppose that the human cortex contains sets of 
acoustic detectors tuned to speech and capable of mutual inhibition* Each de- 
tector may respond to a particular acoustic property, but may be inhibiteu from 
output to the phonetic system in the absence of a collateral response in other 
detectors. The auditory system would then be Immune to adaptation by an incomr 
plete signal. 

The second type of design calls for control and test itemis that are acou8<» 
tically distinct (on dimensions relevant to the phonetic dimension under study), 
but phonetically identical. This design rests, of course, on the fact that the 
speech signal may carry several acoustic cues, each a more or less effective 
determinant of a particular phonetic percept. The procedure is then to synthe- 
size two acoustic continue, manipulating in each a different acoustic cue to the 
same phonetic distinction. If now the two series are imitually effective in 
shifting the phonetic boundaries of the other, we have some preliminary support 
for the hypothetical phonetic detector. This was the outcome of studies by Ades 
(1974b), Bailey (1973), and Cooper (1974a), all of whom demonstrated cross- 
series adaptations for /b-d/ continue with different vowels. The use of differ- 
ent vowels meant that formant transitions cueing a given phonetic type could be 
falling in one token (e.g., /das/), rising in another (e.g., /de/). Thus, adap- 
tation of simple acoustic detectors responsive only to rising or only to falling 
formanta (cf . Whitfield and Evans, 1965) was ruled out. Of course, a more com- 
plex '*acou8cic invariance," derived from some weighted ratio of F2 and F3 tran- 
sitions, might be posited (Cooper, 1974i^). But the conclusion that the detec- 
tors are phonetic was tempting enough for both Ades and Cooper to draw. Ades 
qualified his conclusion because, in a previous experiment (Ades, 1974b), he had 
found no cross-series adaptation of CV and VC continue (/baa-ds/, /aab-aad/): 
the phonetic detector, unlike phonetic listeners and phonological theory, evi- 
dently distinguishes between Initial and final allophones. A funnel Into a 
second level of phonetic analysis, possibly the point of contact with an ab- 
stract generative system, would be needed to account for the listener's inabil- 
ity to make this distinction. 

For Bailey (1973), the phonetic conclusion was less convening. He pointed 
to spectral overlap in the transitions of his two series, and suggested an 
acoustic system involving *^..some generalizing balanced detectors of positive 
and negative transitions*' (p. 31) (cf. Cooper, 1974a). To test for the effect 
of spectral overlap, he constructed two /ba-da/ series, one with a fixed F2 and 
all place cues in F3, the other with no F3 and all place cues in F2. This, by 
far the most stringent version of the phonetically Identical-acoustically dis- 
tinct design, yielded cross-adaptation from the F2 cues series to the fixed F2, 
but none from the F3 cues series to no F3# This argues strongly for auditory 
adaptation, and Bailey concluded that the system contains "...central feature 
extractors which process the phonetically relevant descriptors of spectral pat- 
terns" (p. 34). 
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Clearly, the issue of aufilitory versus phonetic detectors is not resolved. 
But let us consider Inpllcatlons of each possible resolution for speech perception 
theory and research. First, If discrete auditory detectors are being Isolated 
by the adaptation technique, ve nay be In a position to begin more precise defi- 
nition of the acoustic correlates of distinctive feature systems, ultimately 
essential If phonological theory is to be given a physical and physiological 
base. To the extent that this proved possible, we could be Isolating invariants 
in the speech signal, thus aligning speech percept:f.on with that of other *'natur~ 
al categories," such as those of color and form (Rosch, 1973). But it is not 
inevitable that acoustic features be invariant correlates of phonetic features: 
both the work of Ades (1974b) on initial and final stop consonants and the work 
of Cooper (1974b) on vowel«»specific VOT analyzers suggest that invar lance may 
lie at some remove from the signal. And, in either event, to isolate acoustic 
features is not to define them phonetically, nor to explain how they are gath- 
ered from syllables of the signal into phonemes, each with its peculiar, nonar- 
bltrary name: the auditory to phonetic transformation would remain obscure. 

If, on the other hand, the adaptation technique isolates discrete phonetic 
detectors, its unequivocal achievement will have been to undergird the psycho- 
logical and physiological reality of features in speech perception. Salutary 
though this may be for those of little faith, the outcome would be disappointing 
for research. For again, the process by which these features are drawn from the 
acoustic display and granted phonetic dimension will be hidden. To analysis of 
the analyzer a new technique must then be brought. 

Finally, we should not discount the possibility that the auditory^phonetic 
distinctioi) is misleading in this context, and that the adapted systems are both 
auditory and phonetic » Indeed, recent work (Cooper, in press-a, in press^-b) 
suggests that each system can be adapted selectively, yet is intimately related 
to the other. The closeness of the relation is revealed by Cooper ^s (1974c) ex*- 
tension of the adaptation technique to the study of relations between perceptual 
and motor aspects of speech. He has shown that adaptation on a [bi«-pi] contin- 
uum yields not only shifts in the perceptual boundary, but correlated shifts in 
subjects* characteristic VOT values in production. If his findings are rep lie* 
able, we have here clear evidence for the frequently hypothesized link between 
perception and production, and one that may supersede the auditory-phonetic dis- 
tinctions we have been attempting to establish for the»e adaptation studies. To 
the origin of this link in the processes of langiiage acquisition we turn in the 
final section. 

From Acoustic Feature to Phonetic Percept 

As we have seen, template-matching models of speech perception are not in 
good standing. Faced with gross acoustic variations as a ftmction of phonetic 
context, rate, stress, and Individual speaker, theorists have had recourse to 
motor, or analysin-by-synthesis, accounts of speech perception: they have sought 
Invariance in the articulatory control system. Nonetheless, there are grounds 
for believing that some form of template-matching may operate in both speaking 
and listening, and there are more fundamental grounds than lack of acoustic in- 
variance for positing a link between production and perception. 

Consider the infant learning to speak. Several writers (e.g., Stevens, 
1973; Mattlngly, 1973) have pointed out that the infant must be equipped with 
some mechanism by which it plucks from the stream of speech just those acoustic 
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cues that convey the phonetic distinctiona it will eventually learn to perceive 
and articxilate. This fact motivates, in part, Stevens' (1973) pursuit of acous- 
tic invariants and his hypothesized property detectors. Evidence for the exis- 
tence of such detectors comes from the work of Eimas and his colleagues (Eimas, 
Siqueland, Jusczyk, and Vigorito, 1971; Eimas, 1974; for a review, see Cutting 
and Eimas, in press) . They have investigated the capacity of infants as young 
as one month to discriminate synthetic speech sounds. We will not describe 
their method in detail, but broadly, it employs operant conditioning, a synthet- 
ic speech continuum, an adapting stimulus, and a test item. The results are re- 
liable and striking: infants discriminate between pairs of stimuli drawn from 
different adult phonetic categories, but not between pairs drawn from the same 
phonetic category. The effect has been repeatedly demonstrated on both voicing 
and place of articulation continue (cf. Moffitt, 1971; Morse, 1972). Rirther- 
more, the effect is absent for truncated ccntrol syllables, not heard by adults 
as speech, exactly as in the adult adaptation studies. Eimas and his colleagues 
interpret the effect as evidence for the operation of phonetic feature detectors, 
preswably innate. Unfortxinately, the outcome is ambiguous for the same reasons 
as is -he adult outcome: there is no way of assuring that the adapted detectors 
are phonetic rather than auditory (see Cutting and Eimas, 1974, for further dis- 
cussion of this point). The more cautious, and perhaps more plausible, view is 
that they are auditory (cf. Stevens and Klatt, 1974:657-658). 

We are then faced with two questions. First, do the acoustic features ex- 
tracted by such detector systems bear an invariant relation to phonetic fea- 
tures? This is an empirical question and we will say no more here than that 
given the inconstancy of the spe<^ch signal, it is unlikely that they do. Second, 
and more importantly, how does the infant "know" that the extracted properties 
are speech? This, of course, is simply another version of the question: how 
are we to define the phonetic percept? But, asked in this form, an answer 
immediately s\iggests itself: the infant learns that sounds are speech by dis- 
covering that it can make them with its own vocal apparatus. 

Before elaborating this point, let us consider the work of Marler (1970, in 
press). He has proposed a general model of the evolution of vocal learning, 
based on studies of the ontogenesis of male "song" in certain sparrows (see also 
Harler and ?iundingar, 1?71) . Briefly, the hypothesis is that development of 
motor song-pattern is guided by sensory feedback matched to modifiable, innate 
auditory templates (cf. Mattlngly, 1972). Marler describes three classes of 
birds. The first (for example, the dove or the chicken) needs to hear neither 
an external model nor its own voice for song to emerge: crowing and cooing de- 
velop normally, if the birds are reared in isolation and even if they are deaf- 
ened shortly after birth. The second (for example, the song sparrow) needs no 
external model, but does need to hear its uwn voice: if reared in isolation, 
song develops normally, unless the bird is deafened in early life, in which case 
song is highly abnormal and insect-like. 

An example of the third class of bird is the white-crowned sparrow, which 
needs both an external model and the sound of its own voice. Reared in isola- 
tion, the white-crown develops an abnormal song with "...certain natural charac- 
teristics, particularly the sustained pure tones «^ich are one basic element in 
the natural song" (Marler and Mundinger, 1971:429). If the bird is deafened in 
early life, even this rudimentary song does not develop. There emerges instead 
a highly abnormal song "...rather like that of a deafened song sparrow. . .perhaps 
the basic output of the syringeal apparatus with a passive flow of air through 




37 



43 



it" (Harler, In press). However, resred In Isolation, but exposed to recordings 
of nomsl BSle song during s critical period (10-50 dsys after birth), the male 
(and the female, if injected with male hormone) develops normal song some 50 or 
more days after exposure. Exposure to the songs of other species vlll not 
serve, and deafening either before or after exposure to conspedflc song pre- 
vents normal development [Konishi (1965), cited by Marler, in press]. 

Karler (In press) proposes that the rudimentary song of the undeafened, 
isolated white-crown reflects the existence of an auditory template, "...lying 
in the auditory pathway, embodying information about the structure of vocal 
sounds." The template matches certain features of normal song, and serves to 
guide development of the rudimentary song, as well as to ".. .focus. . .attention 
on an appropriate class of external models." Exposure to these models modifies 
and enriches the template, which then serves to guide normal development, 
through subsong and plastic song, as the bird gradually discovers the motor con- 
trols needed to match Its output with the modified template. [Several studies 
h&'e reported evidence for the "tuning" by experience of visual detecting sys- 
tems In cat (Hirsch and Splnelli, 1970; Blakemore and Cooper, 1970; Pettigrew 
and Freeman, 1973) and man (Annis and Frost, in press), and of auditory detect- 
ing systems in rhesus monkey (Miller, Sutton, Pflngst, Ryan, and Beaton, 1972).] 

Marler (in press) draws the analogy with language learning. He suggests 
that sensory control of ontogenetic motor development may have been the evolu- 
tionary change that made possible an elaborate communicative system as pivot of 
avian and human social organization. He argues that "new sensory mechanisms for 
processing speech sounds, applied first, in infancy, to analyzing sounds of 
others, and somewhat later in life to analysis of the child's own sounds, was a 
significant step toward achieving the strategy of speech development of Homo 
sapiens ." On the motor side, he points out, vocal development must have become 
dependent on auditory feedback, and there must have developed "neural circuitry 
necessary to modify patterns of motor outflow so that sounds generated can be 
matched to preestablished auditory templates." 

Certainly, human and avian parallels are striking. Deafened at birth, the 
human infant does not learn to speak: babbling begins normally, but dies away 
around the sixth month (Marvilya, 1972). Whether this is because the infant has 
been deprived of the sound of its own voice, of an external model, or of both, we 
do not know. But there does seem to be an (ill-defined) critical period during 
which exposure to speech is a necessary condition of normal development 
(Lenneberg, 1967; but see Fromkln, Krashen, Curtiss, Rlgler, and Rlgler, 197A). 
And the work of Elmas and his colleague has demonstrated the sensitivity of the 
infant to functionally important acoustic features of the speech signal. At 
least one of these features, the short VOX lag associai:ed with stops in many 
languages (Lisker and Abramson, 1964), is known to be among the first to appear 
in infant babble (Rewley-Port and Preston, 1974). Finally, Sussman (1971) and 
his colleagues (Sussman, MacNeilage, and Lumbley, 1974; Sussman and MacNeilage, 
in press) have reported evidence for a speech-related auditory sensorimotor 
mechanism that may serve to modify patterns of motor outflow, so as to match 
sounds generated by the vocal mechanii»ai against some standard. In short, 
Marler' s account is consistent with a good deal of our limited knowledge of 
speech development. Its virtue is to emphasize sensorimotor interaction and to 
accord the infant a mechanism for discovering auditory-articulatory correspon- 
dences. 
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Paradoxically » If %ie are to draw on this account of motor development for 
Insight into perceptual development, we must place more emphasis on the rela- 
tively rich artlculatory patterns revealed in early infant babble. The infant 
is not bom without artlculatory potential. In fact, the work of Lieberman and 
his colleagues would suggest quite specific capacities (Lieberman, 1968, 1972, 
1973; Lieberman and Crelln, 1971; Liebe^rman, Harris, Wolff, and Russell, 1971; 
Lieberman, Crelin, and Klatt, 1972). They have developed systematic evidence 
for evolution of the human vocal tract from a form with a relatively high 
larynx, opening almost directly into the oral cavity, capable of producing a 
limited set of schwa-like vowel sounds, to a form with a lowered larynx, a large 
pharyngeal cavity, and a right-angle bend in the supralaryngeal vocal tract, 
capable of producing the full array of human vowels. Lieberman (1973) argues 
that this development, taken with many other factors, including the capacity to 
encode and decode syllables, paved the way for development of language. 
Associated with changes in morphology must have come neurological changes to 
permit increasingly fine motor control of breathing and articulation, including 
in all likelihood, cerebral lateralization (cf. Lenneberg, 1967; Geschwind and 
Levitsky, 1968; Nottebohm, 1971, 1972). The outcome of these developments would 
have been a range of artlculatory possibilities as determinate in their form as 
the patterns of manual praxis that gave rise to toolmaking. The inchoate forms 
of these patterns might then emerge in Infant babble under the control of rudi- 
mentary artlculatory templates. 

In short, we hypothesize that the Infant is bom with both auditory and ar- 
tlculatory templates. Each eiabodies capacities that may be modified by, and de- 
ployed in, the particular language to which the infant is exposed* Presumably, 
these templates evolved more or xess pari passu and are matched, in some sense, 
as key to lock. But they differ in their degree of specificity. For effective 
function in language acquisition the auditory template must be tuned to specific 
acoustic properties of speech. The artlculatory template, on the other hand, is 
more abstract, a range of gestural control, potentially isomorphic with the seg- 
mented feature matrix of the language by which it is modified (cf . Chomsky and 
Halle, 1968:294). 

Among the grounds for this statement are the results of several studies of 
adult speech production. Lindblom and Sundberg (1971), for example, found that, 
if subjects were thwarted in their habitual artlculatory gestures by the pres- 
ence of a bite block between their front teeth, they were nonetheless able to 
approximate normal vowel quality, even within the first pitch period of the 
utterance. Bell-Berti (1975) has shown that the pattem of electromyographic 
potentials associated with pharyngeal enlargement during medial voiced stop con- 
sonant closure varies from individual to individual and from time to time within 
an individual. Finally, Ladefoged, DeClerk, Lindau, and PapQun (1972) have dem- 
onstrated that different speakers of the same dialect may use different patterns 
of tongue height and tongue root advancement to achieve phonetically identical 
vowels. They do not report formant frequencies for their six speakers, so that 
the degree of acoustic variability associated with the varied vocal-tract shapes 
is not known. But since individuals obviously differ in the precise dimensions 
of their vocal tracts, it would be surprising if they accomplished a particular 
gesture and a particular acoustic pattem by precisely the same pattem of mus- 
cular action. In short, it seems likely that both infant and adult artlculatory 
templates are control systems for a range of functionally equivalent vocal tract 
shapes rather than for upecific patterns of muscular action. In fact, it is 
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precisely to exploration of its own vocal tract and to discovery of its own pat- 
terns of ouscular action that the infant's aotor learning must be directed. 

We should es^hasize that neither template can fulfull its coimninicative 
function in the absence of the other. Modified and enriched by experience, the 
auditory template nay provide a "description" of the acoitstic properties of the 
signal, but the description can be no different in principle than that provided 
by any other form of spectral analysis: alone, the output of auditory analysis 
is void. Similarly, babble without auditory feedback has no meaning. The in- 
fant discovers phonetic "meaning" (and linguistic function) by discovering audi- 
tory-artlculatory correspondences, that is, by discovering the connands required 
by its own vocal tract to match the output of its auditory template. Since the 
articulatory template is relatively abstract, the infant will begin to discover 
these correspondences before it has acquired the detailed motor skills of artic- 
ulation: perceptual skill will precede motor skill. In rare instances of per- 
ipheral articulatory pathology the infant (like the female white-crowned sparrow 
who learns the song without singing) may even discover language without speaking 
(cf. MacNeilage, Rootes, and Chase, 1967). 

Ve hypothesize then that the infant is bom with two distinct capacities, 
and that its task is to establish their links. Auditory feedback from its own 
vocalizations serves to modify the articulatory teoqplate, to guide motor devel- 
opment, and to establish the links. The process endows the communicatively 
empty outputs of auditory analysis and articulatory gesture with communicative 
significance. In due course the system serves to segment the acoustic signal 
and perhaps, as analysls-by-synthesis models propose, to resolve acoustic vari- 
ability. But its prior and more fundamental ftmction is to establish the 
"natural categories" of speech. To perceive these categories is to trace the 
sound patterns of speech to their articulatory source and recover the commands 
from which they arose. The vVonetic percept is then the correlate of these 
commands. 
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speech Recognition Through Spectrogram Matching* 

Frances IngezDann^ and Paul (termelstein 
Haskins Laboratories, New Haven, Conn. 



In order to assess human analysis of acoustic data before 
attempting such analysis by machine, we conducted a series of experi- 
ments in which subjects were asked to match spectrograms of continu- 
ous speech to reference spectrograms of the same words. Although 
error rates varied with sentence difficulty and size of vocabulary, 
comparison of the matches shows greater agreement in phoneme segments 
than other experiments have obtained in phonetic transcriptions of 
unknown utterances without semantic or syntactic processing. Accur- 
acy in matching can be further Improved by feedback in the form of 
spectrographlc representation of a sequence of tentative matches 
spoken as if they made up the unknown utterance. Automatic matching 
of word- or syllable-sized acoustic patterns may provide a more 
accurate phonemic input to the syntactic-semantic component of a 
speech recognition system than other methods so far attempted. 

The limited performance of speech recognition systems to date indicates to 
us that improved acoustic analysis as well as good semantic-syntactic analysis 
are prerequisites to better system performance. Human analysis of acoustic data 
without the use of nonacoustlc information can be expected to assist the design 
of Improved acoustic analysis systems. 

The difficulty that people have in accurately identifying the phonetic con- 
tent of spectrograms of unknown utterances has long been recognized by research- 
ers in the field (see, for example, Llberman, Cooper, Shankweiler, and Studdert- 
Kennedy, 1968; also Fant, 1962). Until recently little experimentation had been 
undertaken since the early pioneering work at Bell Laboratories (Potter, Kopp, 
and Green, 1947). Within the past few years, interest in spectrogram reading 
has been renewed, at least partially in response to attempts at automatic speech 
recognition In the expectation that cues available to human spectrogram readers 
could be programmed into an auf^tic speech recognition system. 

*A shorter version of this paper will be published in the Journal of the Acoust- 
ical Society of America. 

'^Visiting from University of Kansas, Lawrence. 
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Studies by Klatt and Stevens (1973), Lindblom and Svensson (1973) » and 
Svensson (1974) have shown that subjects who are experienced In examining spec- 
trograms can label phonetic segments correctly less than half the time when they 
are presented with spectrograms about which they have no additional information. 
These experiments have also shown that the addition of syntactic, semantic, and 
prosodic Information can improve the performance significantly. 

Our interest lay in finding out whether speech recognition could be im* 
proved without recourse to nonacoustic information. The technique we proposed 
was the matching of spectrograms of unknown utterances with reference spectro<^ 
grams identified only by number so that success of the task depended almost en*» 
tirely on the ability to match patterns visually. Klatt and Stevens (1973) also 
used spectrographic matching, but because the reference words were known, syn- 
tactic and semantic considerations entered into the selection of suitable 
matches. Our experiments were undertaken to evaluate human spectrogram-matching 
performance before attempting spectrogram matching by machine. 

EXPERIMENT I 

The first experiment was in the nature of a limited pilot study to deter- 
mine whether subjects could match spectrograms at all. In this experiment, as 
In all the experiments described in this paper, spectrograms were based on the 
speech of a single female speaker (one of the authors), who recorded the samples 
in a sound-treated room using a clear (but not exaggerated or over-precise) read- 
ing style. Wide-band spectrograms were produced on a Voiceprint spectrograph 
using a frequency scale of 0-4800 Hz. 

Subjects were asked to locate within spectrograms of five test sentences ten 
words given in reference spectrograms (see Table 1). The reference words were 
content words consisting of one, two, or four syllables spoken in the context 



•TABLE 1: Sentences used in E3cperlment I. 

1. Little children often chew bubble gum. 

2. My friend's grandfather used to grow tomatoes. 

3. He has too much month left at the end of his money. 

4. Ihere is a growi ng interest in Victorian houses. 

5. Emergency regulations will be in effect for six months. 

Underlined words were to be matched to reference spectrograms 
containing the same words. 



**Say again.** Each reference word occurred once in the sentences, except 

that two monosyllabic words (grow and month) occurred In a suffixed form as well 
as in the uninflected form given as the reference word. Subjects were not told 
the meaning of either the reference words or the test sentences, but they were 
given the meaning of the reference frame. 

Three subjects were used: one who had extensive experience examining spec- 
trograms, one who had moderate experience, and one who had nc experience. All 
three subjects performed the task with few errors (75-83 percent correct). The 
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subjects found particularly disconcerting the fact tliat they were told that two 
reference vords occurred twice but they dldn^t knov which* Six of the eight 
errors In matching are related to this aspect of the task. Table 2 lists the 
Identifications made by each subject. 



TABLE 2: Tabulation of responses on Experiment I< 

Responses 

SI S2 S3 

left -,E 

grow 10 0 0 

to 2 - - 

g month 1 - - - 

> 2 - 0 E 

g friend's - - - 

g bubble -EE 

^ children - - - 

*g Interest - - - 

^ regulations - - - 

Victorian - - - 

emergency - - - 

Key: - correctly located 

E Incorrectly located 
0 not located 



EXPERIMENT II 

Since Experiment I had shown that spectrograms could be matched » a second 
experiment was devised to Include all words In a randomly selected text to de- 
termine whether the task could be done as successfully with a larger set of ref* 
erence words, some of which were unstressed. 

The following passage, four sentences long, was chosen at random from a 
publication: 

When adults name things and persons for children they Incidentally 
transmit the texture or grain of their reality. They do this by 
choosing for some referents names that categorize very broadly and, 
for some referents, names that categorize very narrowly. That Is 
what this paper Is about. It does not exhaust Its subject If we 
understand Its subject to be the function of names In tuning one 
consciousness to another (Brown, 1970)* 

These test sentences contain a total of 70 words, of which 51 are different. 

Reference spectrograms were made of the 51 words In the context, "Say 

again," In which again was given major sentence stress to prevent both contras- 
tlve stress on the reference word and a possible phrase boundary juncture be- 
tween the reference word and again. In addition, a second version of some 
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TABLE 3: Responses to Experiment II. 



BEST COPY AVAIIABIE 
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thay 




do 
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that 
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aarrowly 




that 




la 




^at 




thla 




papar 




la 


aotta 


about 





taapooaat by Sub j acta 

S2 S4 S5 S6 

grain 

ia U 

thft th« 
thiAga 

0 for 

for 0 



S7 



thloga 

to 



it 



thlttga 

to 



or 

rafaraata 



it 

ita 
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It 

doaa 

not 

axhauat 
Ita 

aubjact 

If 

wa 

undaratand 
Ita 

aubjact 
to 
ba 
tha 

fuaetlon 
of 

oa»aa 

In 

tuning 
ona 

conaclouaaaaa 

to 

ana that 



0 

thla 
ragllty 

la 

la 

tha 

0 

(papar 



tihaa 
do 

raallty 



- Indicataa corract raaponaa 
0 Indicataa no raaponaa 
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monosyllabic function words in an unstressed context vas provided. The second 
context was not identified for the subjects* who were told only that the word 
was in unstressed position in the second version. 

Six subjects, all of whom had experience examining spectrograms, took part 
in the experiment. Each subject was given only one or two test sentences so 
that the reference set contained approximately twice as many words as a subject 
would find in his sentence. 

The overall score of correct identifications was 67 percent. Mbst errors 
were made on monosyllabic words, particularly function words* A list of the 
errors by subject is given in Table 3 and a sunanary of the results by word type 
la shown in Table 4. Since some of the subjects took part in more than one ex*- 
perlment, the subjects are identified by nimber for the set of experiments, 
rather than separately for each experiment. 



TABLE 4: Responses to Experiment II by word type. 



Total Matches 142 95 67% 

Polysyllabic Hatches 53 47 89Z 

Monosyllabic Hatches 89 48 54Z 

Monosyllabic Content Word Matches 15 11 73Z 

Monosyllabic Function Word Matches 74 37 SOX 
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1 




u 
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95 


53 
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74 


37 
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P4 



EXPERIMm III 

Because monoayllabic vords seemed to be more difficult to match than poly* 
syllabic vords » a third experiment consisting only of monosyXlabic vorda vaa de^ 
signed to examine this area more carefully* The difficulty of the task vas in- 
creased by adding to the reference vords other vords that vere phonetically 
similars 

The follovlng test sentence consisting of ten monosyllables vas used: 

Ed vill ask Ned to pay the bill for him* 

The reference set consisted of the ten vords in the test sentence plus the fol- 
lovlng 30 I making a total of 40 vords: 



do 


win 


met 


bathe 


Kate 


shore 


that 


wuol 


men 


bet 


coop 


hen 


ill 


lass 


neck 


dub 


fin 


as 


add 


last 


beer 


dwell 


full 


them 


A 


mill 


bid 


took 


thumb 


her 
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Once again, for sooie of the nonoayllablc function vorda a second una treated 
variant in a context not known to the aubject waa provided. 



The experiment used four subjects, all of whom had experience examining 
spectrograms. Words were identified correctly only 48 percent of the time. The 
responses of subjects are given in Table 5. In contrast to the previous experi- 

TABLE 5: Responses to Experiment 3. 
Responses by Subjects 



s 



I for - - - ^r- ^ 

S ^11 - . . - correct response 
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that 


do 


do 
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ill 




mill 


for 






mm 




him 


will 









0 no response 



ment» content w)rd8 were not more easily matched than function words. Content 
words were matched correctly only 45 percent of the time» while function words 
were matched 50 percent of the time. 

This experiment also pointed up the difficulty of locating word boundaries 
when the reference set includes words that can be confused. For example » ask 
was identified twice as last and twice as lass because the 1^ of the preceding 
word will was assumed to be part of this word; to was once identified as took 
when it preceded pay . 

A comparison of the string of phonemes in the test sentence with the string 
of phonaodes in thi^ matched reference words shows that the percent of phonemes 
correctly matched is considerably higher than the percent of words. Of the 
phonemes in the sentence » 72 percent were found to be correctly matched and 
35 percent errors were made. The total of these exceeds 100 percent because two 
phonemes in the reference words were sometimes matched to a single phoneme in 
the sentence. The comparison of phonemes is given in Table 6. 

When considered from the point of view of word recognition relative to 
phon^e recognition* the results correspond rather closely to the relationship 
found by Fletcher (1929) between syllable recognition and * letter* recognition 
in testing noisy speech transmission systems. Fletcher *s curves would predict 
77 percent * letter* [phoneme] recognition to accompany 48 percent syllable rec-* 
ognition. The predicted sentence intelligibility for human listeners under 
these conditions is 94 percent. These facts suggest that an automatic speech 
recognition system with performance on acoustic analysis comparable to the 
visual performance of our human subjects and with performance on the syntactic- 
semantic level comparable to that of human listeners can be expected to 



58 



o 64 

ERIC 



TABLE 6: Comparison by phonemeB of matches in Experiment III* 



Matches by Subjects 
SI S2 S4 S8 
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0 no response 
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"understand** 94 percent of simple questions or instructions such as given in 
Fletcher's Intelligibility List* 

EXPERIMENT IV 

The number of errors in the preceding experiments led us to try spectro- 
graphic feedback as a means of Improving performance. Experiment IV began, as 
did the previous two experiments, by asking the subject to match vords in a sen- 
tence to reference words* The sentence to be matched was a truncated version of 
a sentence chosen at random from a paper: 

The way a speaker produces a string of phones will show a good deal 
of variability* 

Two subjects were used: one who had participated in all three of the pre- 
vious experiments and one who had participated in none. One subject (S2) was 
given 120 reference words. Including all the words in Experiments II and III and 
34 new ones. The other subject (S9) was given a subset of these, 78 In all, 
which Included all words in Experiment III, 4 additional words from Experiment II, 
and the 34 new words. These words are listed in Table 7. 

After the subject had tentatively matched the test sentence, the reference 
words he selected were read as a sentence with stress and intonation as close as 
possible to the original utterance. Spectrograms of this sequence of tentative 
matches were given to the subject to compare with the original sentence. He was 
then allowed to revise his list of matches and once again he was given spectro- 
grams of the sequence of matches. This process was repeated until the subject 
indicated that he no longer wished to continue. Both subjects stopped with 
their third attempt. The subject was then asked to give a conference rating for 
each of his matches. The results are given in Table 8. 

The subjects differed greatly in their matching ability, alchough spectro- 
graphic feedback Improved both of their performances. Whereas S9 on the third 
try correctly matched all the words, S2 attained only 38 percent correct* Fur- 
thermore, only 50 percent of the matches in which S2 expressed high confidence 
were in fact correct. However, the ratings did have some validity in that none 
of the low-confidence matches were correct. 

There are a number of possible explanations for the difference betwec^n the 
two subjects* performances. S9 had slightly less than two-thirds of the refer- 
ence words that S2 had. In addition to having more opportunities to make an 
error, S2 also had a greater problem in handling the data physically: S2 had to 
sort through 146 reference spectrograms— 26 of the reference words being repre- 
sented by two spectrograms, one in the standard frame and one in unstressed 
position. 

Another difference was that between trials 2 and 3, S9 requested and re- 
ceived spectrograms of a second reading of the original sentence so that he 
could get some indication of the variation to be expected in a rereading of the 
same sentence. S2 did not receive these spectrograms of the second reading. 

A third difference was the amount of time spent on the task. Although no 
accurate record was kept, S9 estimated that he had spent about twice the time 
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TABLE 7: Reference vords used in Expertntent IV. 
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dub 


men 


string 


a* 


dwell 


met 


(subject) 


able 


Ed 


mill 


swell 


(about) 


eel 


(name) 


tea 


add 


(exhaust) 


(names) 


(texture) 


(adults) 


fall 


(narrowly) 


that* 


(and*) 


fin 


neck 


the* 


(another) 


foam 


Ned 


their* 
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for* 


(not) 


them* 


ask 


(function) 


of* 


(they*) 


astray 


good 


(one) 


(things) 


away 


(grain) 


(or*) 


(this) 


bathe 


hairy 


owe 


thumb 


(be*) 


he* 


(paper) 


to* 


beer 


hen 


pay 


(transmit) 


bet 


her* 


peak 


took 


bid 


him* 


(persons) 


(tuning) 


bill 


(If*) 


phones 


(understand) 


bring 


111 


prod 


variability 


(broadly) 


(In*) 


produce 


very 


(by*) 


(Incidentally) 


produces 


wag 


(categorize) 


Is* 


(reality) 


way 


(children) 


(it*) 


(referents) 


(we*) 


(choosing) 


(its*) 


says 


welsh 


(consciousness) 


Kate 


shore 


(what) 


coop 


lass 


show 


(when) 


could 


last 


shower 


will* 


deal 


rit 


(some) 


win 


do* 


love 


speaker 


wool 


(does*) 


lucky 


stream 


wringer 



* Words followed by an asterisk were given in both stressed and 
unstressed forms. 

0 Words in parentheses were not given to S9. 
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that S2 did. S9 reported that he used quite detailed coarticulation criteria 
whereas S2 attempted to judge matches on the basis of general configuration with 
variations as a function of stress and position of the segment within the word. 

A fourth difference may lie in individual ability to perform the task. In 
the previous three experiments » S2 was usually at the lower end of the range of 
success* S9| on the other hand^ was the only subject participating in the ex- 
periments who succeeded in reading more than occasional words in the sentence to 
be matched. He made only one mistake, identifying good as big * It should be 
noted, however, that most subjects did not make a serious effort at reading the 
spectrograms since that was given as a secondary instruction in Experiment I to 
SI and S2 only and as an option **if you have time** in Exp^^rlment II » Since the 
few subjects who attempted to read spectrograms had not been very successful, 
the instruction was omitted in Experiments III and IV. 

Although S2*8 word recognition score was only 38 percent on this experiment, 
a comparison of phonemes between the words he matched and the words in the orig- 
inal sentence gives a score of 82 percent correct and 24 percent error. When we 
compare this result with Experiment III, we see that although S2 made more word- 
matching errors than the average for Experiment III, he matched more phonemes. 

ADDITIONAL OBSERVATIONS 

Because one of our interests was in automatic speech recognition, we asked 
subjects to report informally on the procedures they used in matching. Most 
ixubjects categorizcid the spectrograms, with varying degrees of rigor, according 
to gross phonetic features; they also noted, even if not always consciously, 
stressed and unstressed syllables. 

Some subjects began by categorizing the reference spectrograms according to 
length and whether they contained stop-like, s-like, or other fricative features, 
and whether the voiced segments contained readily identifiable rising, falling, 
or steady formant patterns. They then examined the test sentences seeking simi- 
lar gross features as a clue to determining which category of reference spectro- 
grams should be examined to find the closest match. 

Some subjects began by searching among the reference words for something 
that could match a portion of the test sentences before or after a pause, since 
in that position at least one of the word boundaries was sure. After a match 
had been made, the adjacent portion was studied. Some subjects also scanned the 
test sentences to look for distinctive patterns of stops and fricatives and then 
se.;i.ched for reference words that would fit. When several reference words had 
thrt same gross phonetic characteristics, more detailed comparisons «ere made in- 
volving frequency, duration, and manner cues that were not considered in the 
first categorization. These comparisons were made visually without making de- 
tailed measurements. 

Most subjects eventually worked from the reference words, taking each in 
turn and visually scanning the test sentences to see if the reference word 
matched any portion of the sentence. This procedure was particularly effective 
for polysyllabic words, the patterns of which were easily recognized when em- 
bedded in sentences. However, this procedure would not be feasible for a large 
set of reference words. 
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Moat subjects matched function w>rdB only after the prominent words were 
located and short Intervening portions remained unidentified. Many subjects had 
difficulty dlstlngulchlng function words from transitional segments connecting 
two words. 

CONCLUSIONS 

Subjects can match spectrograms of unknown utterances with reference--word 
spectrograms better than they can transcribe the spectrograms directly in terms 
of a sequence of phonetic elements. Previous studies (Klatt a^id Stevens, 1973; 
Llndblom and Svensson, 1973; Svensson, 1974) have shown that use of sjintactlc, 
semantic, and prosodlc Information can materially enhance subjects* ability to 
transcribe. Our results Indicate that subjects do not even make full use of the 
acoustic Information present In the spectrograms unless they are given a means 
to assess the significant differences between the spectrographlc manifestations 
of different words. One such means Is comparison of spectrograms. 

Even when the number of words correctly matched la low, the number of syl-* 
lables In the utterance Is for the most part preserved. Furthermore, the number 
of phonemes matched Is higher than the number of correct Identifications report-- 
ed for other acoustic phoneme-recognltlou schemes. This suggests that the out- 
put of this analysls--by*-matchlng technique could yield a more accurate Input of 
the s^yntactlc-semantlc component to a speech recognition system than Is now 
available. 

A process that generates the sequence of matched words In a manner resem-- 
bllng as closely as possible that of the unknown utterance serves as a useful 
source of feedback to the subjects. Comparison of the newly produced form with 
the original reveals differences In patterns which suggest that new choices 
might be made at those points. However, since only two subjects took part In 
the experiment with feedback, the degree of improvement that might be generally 
obtained cannot be predicted. 

SubjectR* error rates In matching vary significantly with sentence d±ff±^ 
culty, size of vocabulary, and general ability to predict the changes a word 
pattern may undergo when placed in an unknown context and spoken with a differ-* 
ent prosody. At least for limited vocabularies, subjects are able to determine 
whether two spectrographlc patterns do or dc not correspond to different pro- 
ductions of the same word more reliably than they are able to assign phonetic 
labels to the speech stream as seen in the spectrogr^ivir^ However, as the size of 
the vocabulary increases, the likelihood of selecting the correct word de- 
creases. Since the analysis time and paper-handling difficulties also Increase 
with vocabulary, manual execution of such tasks can rapidly become Impractical. 
Subjects' ability to generalize the acoustic cues they observe so that they need 
not be presented with all spectrographlc forms but only with a limited subset 
remains to be investigated. 

Automation of this matching process would entail storage of a complete vo- 
cabulary in spectrographlc form, an exorbitant requirement. Various parametric 
representations may be considered but the corresponding storage savings will 
have to be weighed against deterioration of performance as compared with full 
spectrogram matching. Although generalization of the appropriate acoustic in- 
formation into a sufficient set of analysis rules remain*^ the ultimate goal. 
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studies of word natchlng provide useful comparisons with the performance of any 
other method. 

We believe the results of these experiments warrant continuation of our 
studies with computer-assisted word retrieval as a means of developing automatic 
pattern-matching techniques that make best use of those cues found useful by 
humans In establishing reliable word matches. 
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Results of a VCV Spectrogram-Reading Experiment 

G. M. Kuhn and R. McI. McGuire 
Raskins Laboratories, New Haven, Conn. 



ABSTRACT 

We attempted to identify the consonant in 432 spectrograms of 
vowel-conaonant-vowel (VCV) utterances. In five sessions of spectro- 
gram reading, our overall identification rate was 83 percent. An 
error analysis of the results shows that: 

1. During the course of the experiment, our identification 
rate improved from 75 to 90 percent. 

2. Voicing, manner, and place errors occurred on the 
following percentages of the tokens: 

Voicing 01% 
Manner 07% 
Place 16% 

3. The greatest improvement in identification rate came in 
stops and fricatives, the two manner classes that were 
the most numerous. 

We conclude that one can learn to do well at identifying conso- 
nants from spectrograms of utterances of this constrained phonetic 
type. In a further spectrogram-reading experiment we plan to prepare 
a checklist of the cues we have used to identify each consonant. We 
assume that the checklist will help us apply the cues more consis- 
tently, and that it will help indicate where further cues are needed. 

INTRODUCTION 

We wanted to sfee how well we could identify the consonant in spectrograms 
of the V'CV type (where * indicates the presence of prominent stress on the 
following syllable). We were also interested in finding out whether our perfor- 
mance would improve over time and what kinds of errors we would make. We chose 
the V'CV frame for two reasons: first, it eliminated the significant problems 
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of locating and counting the consonants; and second » It Is an environment that 
gave us a clear picture of the cues for the prestressed consonants* 

The consonants ve were Interested In were the 24 phonemic consonants of 
"General American," and three Important allophonlc variants (the glottal stop, 
voiced h, and the apical flap). Our consonant Inventory was then: 

mni]bpdtgk?5crvf^0zs j/Rhwlry 

Ue decided to embed each of the 27 consonants In the 16 environments formed by 
the Inclusive combination of /I a ut/f yielding a total of 432 utterances. We 
chose these vowels since they Include extreme positions of the first three for^ 
mants of the phonemic vowels of General American. The coartlculatlon of the 
same Intervocalic consonant with these different vowels can produce significant 
changes In the acoustic patterns associated with the consonant (Llberman, 
Delattre, Cooper, and Gerstman, 1954; Ohman, 1966). Thus, though VCVs are 
simple In segmental structure, we could not assume that It would be a simple ex- 
ercise to Identify Intervocalic consonants from spectrograms. 

METHOD 

Orthographic equivalents were chosen for the phonetic symbols (see 
Appendix 1) • The use of orthographic equivalents permitted us to create and 
randomize the names of the VCVs using a character string editor on the Hasklns 
Laboratories' DDP-224 computer. To avoid the possibility of identification by 
elimination, the names were randomize* only over the entire 432 positions. The 
randomized names were output from the computer as a printer listing (see 
Appendix 2). One of the experimenters read the listing of the randomization at 
a rate of one VCV per sec. The recording of this reading was used to make 
broad-band Volceprlnt spectrograms of the frequency range 0-4.8 kHz. A linear 
frequency scale (1.2 kHz/ in) was selected. 

The spectrogram-reading sessions were held both morning and afternoon of 
two consecutlv<°t days, and o*. the afternoon of the third day. We read 40, 120, 
100, 100, and finally, 72 spectrograms in the successive sessions. Each session 
proceeded as follows. The next 20 spectrograms were taken from the randomiza- 
tion. Each experimenter attempted to identify the 20 consonants, writing the 
name of the consonant and any comments and observations on his answer sheet. 
V:hen both experimenters had finished with the set of 20, they compared answers 
and argued any differences, but did not change their answers. Then the computer 
listing was checked and the intended consonants determined. Errors were noted 
and rationalized. In each session, this process was repeated set by set, until 
we felt too tired to continue. No session lasted more than three hours* 

After the experimental sessions, the audio recording was used in a control 
session where the experimenter who spoke the randomization attempted to identify 
the Intended consonant by ear. The intended consonant was heard in all but 
three cases. In the following analysis the data were treated as though all con- 
sonants were heard as intended. Since the errant identifications are presented 
in Appendix 3, the reader can proceed with any data analysis he wishes. For our 
own part, we drew up identification matrices (see Appendix 4), and then we set 
up a feature distance metric for all the consonants. The following table of the 
consonants represents the voicing, manner, and place relationships as we — 
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soneiihat ^rbltrarilyo- Imposed then (compare with the scheme presented in Peterson 
and Shoup, x966) . 



TABLE 1 





BILABIAL 


UNI- 
LABIAL 


LINGUA- 
DENT 


ALVEOLAR 

4- - 


PALATAL 


VELAR 


GLOTTAL 


NASAL 


m 






n 




n 

J 




STOP 


b P 






d t 




9 k 


? 


AFFRICATE 










V V 

J c 






FLAP 








f 








FRICATIVE 




V f 


9 e 


z s 


^ / 






SONORANT 


w 






1 


y 


r 





From left to right» place of articulation moves back down the vocal tract, with 
voicing alternating within a place column. From top to bottom, the manner 
classes might be said to show a lesser duration of occlusion. We do not pretend 
that this is a perfect feature assignment, but only that it is one that could 
reasonably be used to find out whether the size of our identification errors 
decreased over time. In calculating feature distance we have assigned unit 
value to a minimal distinction in any dimension, and we have sumned the distance 
along the three dimensions. Thus, if /m/ is identified as /r/, we have said 
that the feature distance is 5 in place, plus 5 in manner, plus 0 in voicing, 
for a total of 10. So defined, feature distance Is the basis for the data in 
Figure 2. 

RESULTS 

Ql. How well were the consonants identified? 

Both experimenters averaged 83 percent in consonant identification over all 
sessions. Given the fact that the phonetic context was chosen to increase the 
probability of success, we feel that this is not a particularly high rate of 
identification. In addition, the feature distance of many of the errors was so 
large that the average feature distance per error was 3.4. 

Q2. Did the identification rate improve over time? 

The results appear more positive when we look at identification and feature 
errors over time. Figure 1 shows that the 83 percent overall identification 
score hides the fact that the identification rate improved from 75 to 90 percent. 
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2 3 4 

SESSION 



Figure 1: Identification by session. 



Figure 2 shows the average feature d^3tance per token, an indicator of how wrong 
each identification was on the average. Here the results averaged over both ex- 
perimenters show a continuing decrease from ses' '.on to session. It appears that 
the identification error rate fell by 60 percent, and that the feature distance 
per token fell by 66 percent. This latter result suggests that the errors grew 
not only less frequent, but also somewhat smaller. 



1.2 

Feature 
Distance 
Per 
Token 





2 3 4 

SESSION 



Figure 2: Average feature distance per token, by session. 



Q3. What kinds of errors were made? 

Overall, voicing, manner, and place errors occurred on the following per- 
centages of the tokens: 

Voicing <01% 
Manner 07Z 
Place 16Z 
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In short, voicing errors were practically nonexlstant, while mnner errors 
occurred about half as often as errors of place. Figure 3 shows that manner and 
place identification errors both decreased over time. This figure also shows 
that the rate of manner errors dropped 66 percent to a final value of 5 percent. 
The rate of place errors, on the other hand, dropped only 50 percent, to a final 
value of 10 percent. Since the overall identification error rate and the place 
identification error rate both dropped to 10 percent, it follows that all manner 
errors in the last session were also errors of place. This result underscores 
the fact that while we have so far talked about manner and place independently, 
they are not truly independent. In other words, there are holes in the phonetic 
pattern, and an error of manner or place can force one to make an error in the 
other dimension. 




O O Manner 

ty a Ploce 

•—•Over oil 

2 3 4 

SESSION 

Figure 3: Manner and place identif icption errors, by session. 



Figure 4 shows identification error rates by manner class. Nasals were 
Incorrectly identified 32 percent of the time. Semivowels, stops, and frica- 
tives were missed about 18 percent of the time. Affricates and flaps were 
missed less than 10 percent of the time. Figure 5 shows error rates for those 
manner classes that were represented by at least ten identifications in each of 
the five sessions. Nasals, stops, and fricatives show improvement, but semi- 
vowels do not. At the end of the experiment, nasals and semivowels have the 
highest error rates. The two experimenters do show some difference in their 
ability to identify stops and semivowels, but the overall pattern of their re- 
sults is BO similar that we feel it is reasonable to draw conclusions based on 
their averaged data. 




No sal I Stop I Affricate I 
Semivowel Fricative Flap 



Figure 4: Identification errors by manner class < 
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Figure 5: Identification errors per manner class, by session. 
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CONCLUSION 



Ue interpret these results to show that one can learn to do veil at identi- 
fying consonants from spectrograms of utterances of this phonetic type. Of 
course it should be bom in mind that the VCV utterance is simple in structure » 
and that we set ourselves a very restricted task« 

The greatest improvement in identification rate came on stops and frica- 
tives » the two most numerous manner classes. As a consequence » we are tempted 
to assume that the error rates on nasals and scaolvowels would fall below their 
final value of 20 percent, if the number of tokens identified approached that 
for the stops and fricatives. 

Our Impression is that the cues we used are docuo^nted in the literature; 
the state of voicing, the shape of fricatives and bursts, the transitions of 
formants, to name Just a few. All these cues are by now classic. We may indeed 
have made novel use of the cues, but since no explicit rules for identification 
were followed during the experiment, we have not presented tables of cues 
drawn up after the fact. Instead, we plan to test our cues on a second set of 
the same 432 VCVs. There, we will attempt to find out whether these explicit 
rules can help us to achieve more consistent resulcs, and whether they can bring 
the problems of place analysis into sharper focus. 

REFERENCES 

Liberman, A M. , P. C. Delattre, F. S. Cooper, and L. J. Gerstman. (1954) The 

role of consonant*vowel transitions in the perception of the stop and nasal 

consonants. Psychol. Monogr. 68 (8, Whole No. 379). 
Ohman, S. E. G. (1966) Coartlculation in VCV utterances: Spectrographic 

measurements. J. Acoust. Soc. Amer. 39 » 151-* 168. 
Peterson, G. E. and J. E. Shoup. (1966) A physiological theory of phonetics. 

J. Speech Hearing Res. 9^, 5-67. 



73 

erIc 78 



I 

ORTHOGRAPHIC 


1 

PHONETIC 


EOUIVALENT 


SYMBOL 


M 


m 


N 


n 


NG 




B 


i 


P 


p 


DD 


d 


T 


t 


G 


9 


K 


k 


• 


? 


J 


J 


CH 


I 


D 


I 


V 


V 


F 


f 


DM 


9 


TH 


6 


ZH 




SH 


f 


H 


i 


HH 


h 


W 


W 

WW 


L 


1 


Y 


y 


R 


r 


EE 


1 


AH 


a 


UW 


u 


ER 


ar 


Upottroph*) 





Appendix 1: Orthographic equivalents # 



79 



g«S«x«3«^o£<3;xttiX9uiujXui<5^S^p^±^^£t£;^^^sSti; 



a X 



X xoe 



QC ae X 



xxawMXXxxujQcoiiA'xXQeacxavxaeQcaxujiijx 



oe X « 141 

Ui ^ X 3» Ul 
flO «i O VO Qt 



QCUIQgXQtXujX^a XMXujX 



X C IA« 



X X 



IT 



X 



«tttXiii«Ka^^z^^X^Xujxujax«^x3aexx«tx^xacXXuj«z< xxxocx 

x5S*A*xSxo*tXarfSM^^03«3«tUi«;Xaja««X2X2uJ»2U*0-jO 



QC li^ ft 


2 a 


ft X 


ft 


UI 


X 


a 


2 


ft UI 


ft X 


iii ui ui 


a a 


UI ^ 


UI 


UI 




a 


a 


UI UI 


UI a 


urf a ft 


X 

m ui 


a M 


ft 


ft 


3 


a: 




X a 


a a 


a ui 


X IM 


a UI 


UI 


UJ 


a 


UJ 


ui 


< a 


a a 


V) X 2 




ft u. 




a 




3 


X 




X 


ft • » 


• • 


• » 


• 


• 


» 


» 


• 


• • 


• • 


u« ft ft 


X ft 


Z X 


a 


ft 


a 


UI 


ft 


ft a 


X X 


yy ^ a' 






a 


UI 


a 


UI 


41 


UI a 


< <l 




a ft 












ft 


a 




V> iM «fc 


Mi 




mm 


ft 


- - M 




UI 


a a 


ft ^ 


UJ WU OJ 




Oi <« 




UI 


UI 


5 


X 


o a 


UJ UJ 


o a a 


X X 


c: 


%/> 


o 


X 




X 


X K 


X >* 


ft ft OJ 


X X 


UI X 


a 


UI 


X 


a 


UI 


a ft 


X a 


«AJ ^ U4 




UI «c 


a 


Oi 




a 




a 


*f a 




X ^ 


UI 


z 








UI 




UI 


a a 


<« X 


X UI 




ft 




X 


UJ 


X X 


ft u< 


a p 


X « 


« X 


X 


UI 


o 




X 




UI X 


^• > 


M o 


ft w 


I/* 


m 


X 




X 






• • • 


• • 


• • 


• 


• 


• 




• 


• • 


• • 


a X a 


X X 


tu ft 


X 




X 


X 


X 


X X 


X ft 


a « a 




UI UI 




jj 


UI 


UI 






UI UJ 


Z 




a a 










X 






< a u« 


a X 


a a 


z 


Z 




X 


« 


UI a 


UJ X 


X a tti 


a ^ 


X z 










o 


UI a 


UI « 


O (S 


X 


>SI X 




u. 


>• 




o 


1^ — 




• • • 


• • 


• • 




• 


» 


» 


• 




» • 


uj a iw 


ft ft 


ft « 


ft 




UI 


ft 


UI 


ft UI 


UI UJ 


Urf a iM 


^ lAj 


u* « 


UI 




UI 




UI 


UI u« 


UJ UJ 
















ft 


3 X 






ft ^ 


oe UI 


X 


a 


UI 


X 




a a 


4 O 


Ui 


UI UI 




a 


UI 




S 


a^ 


a a 


X o a 


V X 




X 


ft 


QD 






ft 




• • • 


• • 


• • 


• 


• 


• 


• 


• 


• • 


• • 


a a a 


IM ft 


z a 




a 






a 


X X 


UI fit 


a aa 


Oi 


^ a 


a 


a 




UI 


a 




UI JJ 


a ft ft 
a oj 


ft ^ 


a 


s 


UI 
UI 


X 


a 

a 


UI 
UI 


UI 


UI 
ft UI 
UJ X 


K u» 


I4» X 


o a 


ft 






o 


O 


X I/I 




• • • 


• • 




• 


» 


• 


• 


• 


» • 


• • 


X a X 


ft ^ 




ur 


ft 


a 


X 


UI 


UI ft 


ft X 


a p ^ 








UI 


a 


m, 


UI 


UI Ui 


UI ft 


ft 


ft ft 














ft 


UI 


1 X *A4 IM 




a 


UI 


ft 


X 


a 


UI UI 


X UI 


1 ^ Z 


X X 


UI Z 


a 


UI 


UI 




a 


uio 


ft C» 


1 ft Nl > 




ft 




%L> 






a 




ft X 


» • • 


• • 


• » 


• 


m 


• 


• 


• 


• • 


• • 




a ft 


UI X 




UI 




ft 


ft 


X ft 


X ft 




a oi 


UI < 




UI 




\u 


UI 


<UI 


ft UI 



ft ft 



UI UI 


UJ 


ft 


ft 


ft 


ft 




X 


a 


UI 


ft 


X ft 


a 


a 


a 


UI 


X 


X 


X 


UI 


ft 


UI UI 


UI 


UI 


UI 


UI 


UI 




ft 




UI 


UI 


ft UI 


a 


a 


a 


UI 


ft 


ft 




UI 


UI 


ft 


X 


ft 


a 








u» 






UI 






X 










ft 


UI 








UI 


a 


UI 


ft 


til 




X 


ft 


UJ 


UJ UI 


a 


ft 


a 


X 


a 


ft 


UJ 


UI 


a 


X UI 


X 


X 


o 


UJ 


UI 


UI 


X 




UI 


o 


UJ UI 


a 


X 


a 




a 


UJ 


X 


X 


a 




X 


X 


X 


a 






z 






a 


O X 






o 










«l 


«j 


» • 


• 


• 


• 


• 


• 


• 


• 


• 


• 


• 


• » 


• 


• 


• 


• 


• 


• 


« 


m 


• 


ax 


UI 


X 


UI 


UI 


a 


UJ 


ft 


X 


ft 


X 


ft 3 


X 


UI 


ft 


UI 


UI 


X 


ft 


X 


a 


a p 


UJ 


« 


UI 


-41 


a 


UI 




ft 


UI 


ft 


UI a 


ft 


UI 


UI 


ul 


M 


ft 


Ui 


ft 








X 




X 


UI 






a 


X 




3 


UI 








X 






UJ 




X 3 


OJ 


ft 


1C 


ft 


UI 


z 


3 


a 


ft 


a 


X P 


UJ 


ft 


X 


a 


ft 


X 


a 


UI 


UI 


ft =» 


UJ 


X 


UI 


X 


X 




a 


X 


X 


a 


a X 


X 


UI 




a 


o 


a 


X 


X 


UI 




ft 


\J 


ft 






X 




X 




1^ 


> X 




> 






X 






w 


X 



uiftxftxaaxftuiftxujftftxuiftaxaaui 
uii4iauiftaaftuiuiuiftujuiujftuiui aft aaui 

UI ft a a a 

uiftuiujaftXXujxazuiaaftXftuiaaaa 



X X 



a ft 




> 



fl 
O 



CO 

1 
I 



1 



Appendix 2 



ERIC 



75 



CO 



» > S S B 



s 



BSSSBB 

«) •« ?<« «a 
^ M ^ « «s M 

m m m f*^ 

8 2 85 58 5 5 S 33 a B'S SB2 BB 2 558828 2flB 58*8 B 



2 2 5 B 3 B 3 5 B 2 2 S S S *fi 5 S *5 S *B *S *5 S 5 3 *2 B *B *5 B B *fi 8 B 8 8 5 



5 



BSBBSBSBS 



^ i ^ • to » • 
^ ^ ^ « ^ « • 



S a S » S M M M u 5 S 5 S i S £ ^Ki 



*B *a 8 5 S 5 *B *S *B B *S 8 *B B *B *B B *B B 2 *B 2 8 *8 8 B *8 B *2 

S 8 *5 8 *5 2 *B S B B *2 *5 8 B B B B B 5 5 *8 *5 B *5 *5 5 B *8 5 *5 8 *5 B 8 8 *5 B 



i; 
8 



I 
I 



Appendix 3 



76 



ERIC 



81 



LU Qdcr O 



CC CO 



CJ 
UlI 











o 


in 




CO 






<N 








n 










p. 
















































































































p. 


















^ - 














































T 
























































S2 




























p. 
























p. 


CM 


































































































































































JO 
































































































































(D 




















































































M 
























p. 


























































































- 


















CM 


































JO 


























- 


















fSl 










































>^ 




































































































































































































































































































12 












































a 




















































































































K 


o 


















































c 
























































E 
























































E 


c 






a 






0) 








Hi 




> 




«0 


O 


N 





















UJ 
CO 

z 

o 

o. 

CO 



o 

U 

fd 
u 



a 



CO LU 



Appendix ^ 



77 



o g2 

ERIC 





QQQQaCiaQaQQQQQDQBBBQBQBQDaa 


■ 
















































































































JO 
























































n 




























































































































vC 












































CM 


































































































































M 


























































N 




































JO 
































(N 










fM 










CM 


























Q 






































































- 




































- 












> 








r) 




















00 




























B 




























































>u 














CM 




CM 








































>— » 






















































B 








































































CO 




CM 






































5 
















2 












































































































CM 


















































a 


















































































































B 








o 




















































c 
























































B 


E 




























































E 


c 


c 






















> 






® 


N 























Appendix 'tB 



83 



LU 
O 





1^ 




M 


1^ 






00 




o 




CM 


CM 








3 




CO 






«. 


0^ 


v> 




K 






























































CM 












































p. 
















^ 


































p. 
























CO 


> 
























































CO 




































CM 










CM 


CM 










CO 


VMM 




















^ - 








CM 
















W 












00 


























































CM 

AM 






















































p» 




Pi 






































mm 




















CM 
fO 






































CM 




















CO 












n 










CM 














S? 




p» 




















































IIM 




CM 














p. 






CO 
































CM 
















CM 












> 




























CM 




CM 
















<N 






CM 




































CM 
























CO 




























































>— ^ 
























CM 
































Pi 






















CD 
CM 
































































































o> 
















V 

CM 
















































































































«^ 
n 










CM 


































CO 


a 


















CM 






































CO 

£! 


















CM 












CM 




























CM 

n 




CN 






















































CO 






0 


CM 












































CM 






"A 

CM 








O 






















































E 


C 


c 




o 






0) 








HI 


1" 


> 






• 


N 


M 





















CO 



bJ 
CO 

z 
o 

o. 

CO 



o 



U 
4-1 



g 



d 

I 



COUJ 



Appendix ^ 



79 



o 84 

ERIC 



Evidence for Spectral Fusion In Dlchotlc Release from Upward Spread of Masking 

Ter ranee M« Nearey and Andrea G. Levitt 
Hasklna Laboratories » New Haven» Conn. 



INTRODUCTION 

Evidence from recent experiments conducted at Hasklns Laboratories (Nye> 
Nearey> and Rand» 1974; Rand» 1974) Indicates at least two Important points. 
(1) The first formant (Fl) of synthetic stimuli can mask higher formants (F2, 
F3> and the fricative components). In particular » It has been shewn that this 
upward masking effect can resiilt in a loss of Information about the place of 
articulation for stop consonants.^ (2) A release from masking ou the order of 
20 db can be obtained If the signal Is spectrally divided and presented dlchot«> 
Ically — Fl directed to one ear and the higher formants to the other ear. 

One difficulty with the previous studies Is that. In principle. It was pos- 
sible for the listener to make the necessary dlscrlmlnatloiis by attending to the 
ear receiving the upper formant Information alone. The fusion of the Inputs to 
both ears and the perception of the combined sounds as speech were not necessary 
to produce correct responses. Nevertheless, anecdotal comments from the listen** 
ers appear to Indicate that fusion normally took place. Further support Is 
available from previous reports that listeners perceive spectrally divided sig- 
nals as a single voice In a single location (Broadbent and Ladefoged, 1957; 
Cutting, 1973). Moreover, additional evidence (Carlson, GranatrSm, and Fant, 
1970) suggests that when spectral fusion of vowel formants occurs, phonetic In- 
formation can be extracted from both dlchotlcally presented channels. 

Our present study sought the evidence that fusion occurs In conjunction 
with a release from masking. The basic technique was to create stimulus condi- 
tions where contributions from both channels were required to make the necessary 
phonetic Judgments. Clear evidence for spectral fusion with a release from 
masking was found for vowels In one experiment* In a second experiment, the re- 
sults suggest the operation of a similar fusion effect when listeners attempt to 
discriminate place versus voicing cues for stop consonants. 



Also University of Connecticut, Storrs. 
Also Yale University, New Haven, Conn. 

^A similar information loss occurs in natural speech which has been low^pass 
filtered or mixed with Gaussian noise (Miller and Nicely, 1955). 

[HASKINS LABORATORIES: Status Report on Speech Research SR-39/40 (1974)] 
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METHODS 



Experiment 1 

The essential strategy for testing a hypothesis of spectral fusion Is to 
provide attmill sets with no redundancy between channels for selected phonetic 
properties. In the first of our experlnients» this was accomplished by the con- 
struction of a triplet of three-formant vowels, [e], [ae]> and [a]» with the 
following properties* (1) The nominal amplitude values of all the corresponding 
formants were the same, e.g.., the amplitude of Fl of [e] was equal to the ampll-- 
tude of Fl of [a]. In fact, small differences occur In the signals with differ- 
ent formant frequencies; however, these differences of less than 1 db are Insig- 
nificant compared to the attenuation factors used In the experiment* (2) The 
durations of the vowels were Identical* (3) The formant frequencies of the vow- 
els were chosen so that [e] and [ffi ] differed only In Fl, while [£ ] and [c] 
differed only In F2* The frequency of F3 was the same for all three vowels* 
The practical consequence of this stimulus choice Is that Information from both 
Fl and F2 la required to keep all three vowels distinct* Loss of Fl Implies the 
loss of the [c]/[s] distinction, while loss of F2 In^lles the loss of the 
[ae]/[a] distinction* 

Tliree formant transition burst patterns appropriate for [b], [d], and [g] 
were provided for each vowel stimulus, resulting In a final stlsoulus set of nine 
consonant-vowel (CV) syllables* The consonant portions were adjusted empiri- 
cally fcr each of the nine stimuli with the restriction that vowels with identi- 
cal steady-state Fl*s were provided with identical Fl transitions* The selec- 
tion of F2,F3 values for the consonantal portions was restricted only by the re- 
quirement that they reach the steady-state vowel values by the time the Fl fre- 
quency transition had ended (see Figure 1)* ^ 

Two tapes were prepared for the experiment* Each tape contained eight 
blocks of 27 randomly ordered stimuli (each of the nine stimuli appearing three 
times in each block) One complete tape was used for each condition, and tapes 
were alternated between conditions* There were six dichotic conditions, for a 
total of 1296 trials; (9 stimuli x 24 occurrences per tape x 2 ear/formant con- 
ditions X 3 attenuation levels — 10, 20, and 30 db)* In the two binaural condi- 
tions there were 432 trials: (9 stimuli x 24 occurrences per tape x 2 attenua- 
tion levels — 10 and 30 db) * The tapes were played at a baseline level of 85 db 

Twelve subjects, none of whom had any known hearing loss, participated in 
the 3xperlment and were paid for their participation*^ They were divided Into 
four equal groups; each group took part in a 2 1/2 hour session which included a 
20 minute break* Presentation of the six dichotic conditions and the two binau- 
ral conditions was balanced across groups. 

Initially, the subjects were told that they would be hearing nine CV sylla- 
bles, [be], [de], [ge], [bae], [ds], [gae], [ba], [da], and [ga]* They were 



See Nye, Nearey, and Rand (1974) for a definition of the ^'baseline sound pres- 
sure level*" 
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then Instructed to write down either a phonemic transcription of vfaat they heard 
or a corresponding English word, e.g., "bed*" for [be], "bad" for [b»], etc. A 
chart of word responses was available for reference during the experiment. 
After a brief practice session the experiment began. 

Experiment 2 

For the second experiment a different set of nine stimuli was used: six 
consonant stimuli and three pseudo-consonants followed by the vowel [a]. The 
six consonants were the voiced stops [b], [d], and [g] and the voiceless stops 
[p]> It], and [k], while the pseudo-consonants, or mixed stimuli, were half un- 
voiced-half voiceless sounds that have no counterparts in natural speech. The 
mixed stimuli were constructed from the Fl portion of the voiced stimuli and the 
F2,F3 portions of the voiceless stimuli. The Important consideration here is 
that for all stimuli, place information was carried exclusively by F2 and F3, 
while in the case of the mixed stimuli there were conflicting voicing cues in F2 
and F3 (voiceless) versus Fl (voiced) (see Figure 2). All stimuli were produced 
on the Haskins Laboratories* parallel formant resonance synthesizer.^ 

Six tapes, each containing 72 stimuli, were produced to give a total of 432 
trials delivered at a baseline level of 85 db SPL: (9 stimuli x 2 repetitions x 
3 presentation conditions x 2 noise conditions x 4 attenuation levels). Each 
tape was designed for monaural, dichotic, or binaural presentation, in one case 
without noise and in the other case with the addition of a Gaussian noise signal 
(signal/noise ratio +6 db). The F2 and F3 components of all the stimuli were 
attenuated 0, 10, 20, or 30 db. Furthermore, all the stimuli were randomized 
and balanced for ear of presentation on both the monaural and dichotic tapes. 

Nine students, none of whom had any known hearing loss, were paid for their 
participation in the experiment, which lasted one hour. The subjects were di- 
vided into three groups. Although all of the groups encountered the three tapes 
without noise first, the order of presentation of the monaural, dichotic, arid 
binaural tapes was balanced across groups. 

RESULTS 

The chief results of the first experiment are presented in Figure 3. In 
the dichotic condition there are essentially no vowel errors at any attenuation 
level. By contrast, in the binaural condition, the error rate for vowel identi- 
fication rises to 32.7 percent under a 30 db attenuation of F2 and F3. An anal- 
ysis of the errors reveals that this high rate is due almost entirely to [s]/[a] 
confusions, consistent with the masking of F2 and F3 by Fl.^ Identification of 



We thank Tim Rand who constructed the stimuli and prepared the tapes used in 
this experiment. 

^This result is consistent with the findings of Alnsworth and Millar (1972). In 
experiments which varied vowel formant amplitude levels, vowel identification 
was basically unaffected until F2 amplitude reached a level of 28 db below Fl. 
Beyond that level, they note that vowel errors occurred between vowels with the 
same Fl values. 
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the vowel [e] remains essentially unaffected; the phonetic distinction between 
It and the other two vowels always Involves an Fl difference* Even at the 10 db 
attenuation levels there are significantly more vowel errors In the binaural 
condition (p £ •Ol), and once again the errors are heavily concentrated In 
[s]/[a] confusions. It should be pointed out that consonant errors are also 
significantly higher In the binaural 30 db attenuation condition than In the 
corresponding dlchotlc condition as Indicated by a paired difference test 
(p<.01). There Is considerable variability In the Intelligibility of lndlvid«> 
ual consonants, apparently due largely to differences In the energy relation-* 
ships between the consonantal portions of Individual tokens. One consonant » 
[s]> performs extremely well In all conditions ^ Its resistance to masking Is 
probably because In all three syllables containing [g] there Is a simulated 
burst of relatively high F2 and F3 energy near the onset of each syllable where 
the amplitude of Fl Is relatively low. 

The results of the second experiment also show evidence of a release from 
masking as well as fusion of the spectrally divided signal. The results of this 
experiment were scored for number of correct responses for place and voicing In 
the cases of the voiced and voiceless stimuli » and for number of correct re- 
sponses for place and proportion of responses *Volced" In the case of the mixed 
stimuli. 

The findings for the voiced CV syllables show a highly significant release 
from masking for the dlchotlc no-noise condition when F2 and F3 are attenuated 
30 db (p < .01) (see Figure 4). Overall performance is considerably lower for 
the tapes with noise » however » and no significant release from masking was found 
for dlchotlc presentation* 

Under dlchotlc conditions an overwhelming proportion of the responses to 
the mixed stimuli are reported as "voiced." Given the nature of these unnatural 
stimuli and their randomized occurrence among pure voiced and pure voiceless 
stimuli » the fact that place identification is significantly better than random 
for the "vOi.ced" responses provides some evidence that fusion is in fact taking 
place — because place information can only be extracted from the channel provid- 
ing F2»F3 information, while voicing Information can only be obtained from the 
fully voiced Fl transition ^iupplied to the opposite ear. It should be pointed 
out, however, that the potency of aspiration of the higher formants as a cue to 
voicelessness may not be very great, and no formal control experiment with F2 
and F3 presented alone was run to measure the strength of this feature. 

DISCUSSION 

The results of these two experiments confirm the evidence from previous 
studies (Mye, Nearey, and Rand, 1974; Rand, 1974) that release from masking 
occurs under the dlchotlc mode of presentation, in which Fl Is sent to one ear 
and the higher formants are sent to the other ear. Further evidence indicates 
that fusion is in fact taking place under the dlchotlc presentation. This is 
seen most clearly for vawels in the first experiment In which fusion of the 
higher formants with Fl is found i occur in the dlchotlc condition. The mixed 
stimuli of the second experiment provide additional evidence for fusion, this 
time in the case of consonants. Although the experimental design in the second 
experiment did not incorporate a control condition which would have conclusively 
demonstrated that F2,F3 of the mixed stimuli are strong, voiceless cues ^en 
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heard alone, subsequent testing with the mixed stlsnill has shown that they are 
Indeed heard as voiceless consonants. The extremely high Incidence of voiced 
responses to these stimuli In the dlchotlc condition thus Indicates subjective 
fusion of the voiced Fl component with the voiceless F2 and F3 components. 

In the binaural conditions of these experiments voiced Fl clearly acts as a 
strong mask on the attenuated higher formants, but the "cutback" Fl has no simi- 
lar effect. However, whether the Fl transition alone is masking the higher for- 
mant transitions or whether there is backward masking of the steady-state Fl on 
the higher formant transitions is not clear. In order to provide further evi- 
dence about the type of masking that occurs, Fl can be tenqporally offset with 
respect to the higher formants so that it either precedes or follows F2 and F3 
at equal Intervals. This procedure should help to determine whether the Fl 
transition or the Fl steady-state portion of the vowel is the more effective 
masker of the higher formants, and whether backward masking In addition to si- 
multaneous masking occurs in the binaural condition, and possibly in the dlchot- 
lc condition. Research is currently underway to seek evidence of these possible 
masking effects. 

REFERENCES 

Alnsworth, W. A. and J. B. Millar. (1972) The effect of relative formant am- 
plitudes on the perceived identity of synthetic vowels. Lang. Speech 15, 
328-341. 

Broadbent, D. E. and P. Ladefoged. (1957) On the fusion of sounds reaching 
different sense organs. J. Acoust. Soc. Amer. 29, 708-710. 

Carlson, R., B. GranstrOm, and G. Fant. (1970) Some studies concerning percep- 
tion of Isolated vowels. Quarterly Progress and Status Report (Speech 
Transmission Laboratory, Royal institute of Technology, Stockholm, Sweden) 
STL-OPSR 2-3 . 19-35. 

Cutting, J. E. (1973) Phonological fusion of synthetic stimuli in dlchotlc and 

binaural presentation modes. Hasklns Laboratories Status Report on Speech 

Research SR-34 , 55-56. 
Miller, G. A. and P. Nicely. (1955) An analysis of perceptual confusions among 

English consonants. J. Acoust. Soc. Amer. 27^, 338-352. 
Nye, P., T. Nearey, and T. Rand. (1974) Dlchotlc release from masking: Further 

results from studies with synthetic speech stimuli. Hasklns Laboratories 

Status Report on Speech Research SR-37/38 . 123-137. 
Rand, T. (1974) Dlchotlc release from masking for speech. J. Acoust. Soc 

Amer. 55, 678-680. [Also In Hasklns Laboratories Status Report on Speech 

Research SR-33 (1973), 47-55.] 



89 




93 



The Tones of Central Thai: Some Perceptual Experiments* 
\rthur S. Abramson 

Haskins Laboratories , Hew Haven, Conn. 



INTRODUCTION 

In recent years, research Into prosodlc features as part of the effort to 
understand the nature of speech comnunlcatlon has become an Increasingly con- 
spicuous aspect of experimental phonetics (Fry, 1968; Lleberman, 1974). Within 
prosody, such features as phonologlcally distinctive stress, segment length, and 
tone are clearly central to the sound pattern of a language In that they differ- 
entiate linguistic expressions. These are usually of more ijmaedlate Interest to 
the linguist than are other types of prosodlc features, and it is usually easier 
to design experiments testing hypotheses concerning them.^ Prosodlc research of 
this kind should make contributions to general phonology and, more narrowly, to 
our understanding of the phonology of a particular language. In this instance 
Thai. 

The research reported here fbrms part of a larger project; other aspects of 
the project will be presented elsewhere. Taken together, these studies should 
furnish much information on the perceptually tolerable ranges of the tones of 
Thai. The work takes as its starting point the usual assumption that the major 
phonetic features of phonemic tone are found in the domain of pitch. The prima- 
ry acoustic correlate of pitch is, of course, frequency. In instrumental analy- 
ses of tonal features, then, we measure the fundamental frequency of the voice 
as determined by the repetition rate of glottal pulsing. 



*This is a slightly revised version of a chapter to be published in Studies in 
Tai Linguistics , ed. by Jimmy G. Harris and James Chamberlain. (Bangkok: 
Central Institute of English Language, in press). 

"""Also University of Connecticut, Storrs. 

^This is not to assert that other realms of prosody, such as sentence intonation 
and rhythm, are irrelevant to the enterprise of phonetic research. Indeed, 
some important psychollngulstic questions have been raised in work on intona- 
tion; see, e.g., Lleberman (1967) and Studdert-Kennedy and Hadding (1973). 

Acknowledgment ; This research was conducted while the author was on sabbatical 
leave in Thailand on research fellowships from the American Council of Learned 
Societies and the Ford Foundation Southeast Asia Fellowship Program. I grate- 
fully acknowledge the hospitality of the Faculty of Humanities, Ramkhamhaeng 
University, and the Central Institute of English Language, both in Bangkok. 
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The language under analysis here Is Central Thai (Siamese) , the regional 
dialect that serves as the official language of Thailand. All native speakers 
used as subjects were from Bangkok or Its close environs. The question of the 
tonal homogeneity of the central area of the kingdom has not been well explored^ 
and thus cannot be categorically ruled out as a perturbing factor In some of the 
data presented; nevertheless > the speech of the test subjects gave an Impression 
of general uniformity In tonal behavior. 

One aim of this study was to determine how well the five tones of the lan- 
guage could be Identified In Isolation. It Is at least conceivable that the 
Identlf lability of one or more of the tones would suffer without the benefit of 
an Immediate context. Another aim was to reaffirm earlier work (Abramson, 1961, 
1962) on the sufficiency of certain Ideal fundamental frequency contours for the 
Identification of the tones , using synthetic speech In which It would be possi- 
ble to make frequency contours the only variable. In the expectation that less- 
than-perfect Identification would be achieved with fundamental frequency alone, 
we also planned to le&m whether the addition of variations In the amplitude of 
the speech signal would enhance the Identlf lability of the tones. Finally, we 
proposed to test the strong hypothesis that absolute fundamental frequency 
heights contribute nothing to the Identification of the tones, while the shapes 
of the frequency contours carry all the Information. 

BACKGROUND WORK 

Much of the present work Is a continuation of earlier work done by the 
author (Abramson, 1962) with fewer Informants and test subjects. That study 
showed (p. 128) that sets of tonally differentiated monosyllabic words, as pro- 
duced by a single speaker, could be correctly Identified nearly 100 percent of 
the time. In addition, fundamental frequency (fo) measurements were taken from 
a large sampling of monosyllabic words with both short and long vowels, yielding 
average contours for the five tones (pp. 112-127). ^ 

These average f^ wontours were then synthesized to see If Thai speakers 
could Indeed Identify each of the tones on the basis of f^ alone. The synthe- 
sizer used was the Hasklns Laboratories' Intonator (Abramson, 1962:20), which 
enabled the experimenter to analyze the spectrum of the speech signal and then 
resyntheslze It on the machine's own * Voice" source with new f^ contours. Thus, 
most of the phonetic features of the original utterance are kept even when an fQ 
contour Is Imposed. The five average tonal curves were thus Imposed upon sylla- 
bles that had orlslnally carried the mid tone, namely /khaj/ 'dried sweat' and 
/loo/ 'unstable.'^ The two perception tests prepared In this way exposed native 
Thai listeners to the tonal contours on both short and long vowels « The percep- 
tual labeling of the randomized stimuli of these two tests showed clearly that 
the Isolated f^ contours provided sufficient cues for Identifying the tones 
(pp. 131-132). 



The topic Is being studied by Dr. Udom Warotamaslkkhadlt. 

^It Is gratifying to note that In data obtained some lA years later (most of the 
tonal data for the 1962 publication were collected before 1960) Erlckson (1974) 
provides general verification of the old contours while adding important lnfor«* 
nation on the perturbing effects of initial consonants. 

In /loolee/. 
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In another experimental condition^ £lve monoayllablc vorda minioally dif- 
ferentiated by tone were manipulated with the Intonator to yield 25 new aylla-* 
blea. That l8» five new ayllablea were derived from each apoken ayllable by re-- 
moving the original tonal contour and replacing it with the synthetic ccnrours, 
one by one» on this syllable carrier. This was done to see whether the curves 
carried enough information to override the effects of other features found in 
association with the pitch movements of the tones. ^ The results (pp. 131-** 134) 
included a small number of confusions apparently attributable to such concomi- 
tant features as variations in amplitude and duration^ which would be likely to 
survive the analysis and resynthesis» although by and large the f^ curves were 
heard as intended. In general » then» the data supported the conclusion that the 
f^ contours isolated by means of acoustic analysis furnished sufficient cues by 
themselves for the identification of the five tones of Central Thai. 

Experiment 1: Isolated Monosyllabic Words 

Earlier work (Abramson, 1962:128) indicated that Thai listeners can identi- 
fy the tones of laonosyllabic utterances nearly perfectly. Although four sets of 
tcnally differentiated words were used in that study » the productions of only 
one speaker furnished the stimuli. R. B. Noss (personal communication) has ar- 
gued that generalizing from these results is> perhaps » not warranted and that 
normally the mid and low tones, at least, are difficult to identify unless they 
are embedded in a context. We thought that this objection might be handled by 
replicating the experiment with a somewhat larger number of speakers and listen- 
ers. In addition, data from responses to real speech were needed to furnish a 
standard for the later evaluation of results obtained with synthetic speech. 

The following set of words was chosen for all the experiments to be de- 
scribed: 



Tone Thai Script 



Mid 
Low 

Falling 

High 

Rising 



h 

1/7 



Transcription 
/khaa/ 
/khila/ 
/kh&a/ 
/khia/ 
/khaa/ 



Gloss 

a grass (Imperata cyllndrlca) ' 

galangal, a rhizome' 

slave, servant' 

to engage In trade* 

leg' 



The tone names are - anventlonal but not fully descriptive. Ten native speakers 
of Central Thai recorded three or four randomizations each of the list with 
short pauses such that there were five tokens of each word In each randomization. 
The speakers, five men and five women. Included nine university instructors and 
one clerk. 

The tests were played to 25 native speakers of Central Thai through head- 
phones in the language laboratory (Tandberg Teaching System IS 6 with Beyer 



^This kind of information can preserve the tonal distinctions to a severely 
limited extent in whispered speech in which no fo is present (Abramson, 1972). 
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DT93B Dynamic Headphones) of Raakhaohaeng University, Bangkok, twice a week for a 
month. Only one teat order was used for each of the ten speakers. Here and in 
later experiments the subjects were Instructed to write the numerals 0, 1, 2, 3, 
or A for the mid, low, falling, high, and rising tones, respectively. These 
nuBibeis are appropriate to the nomenclature of traditional Thai grammar and 
facilitated later scoring. Familiar as the subjects were with this convention, 
they were nevertheless provided with a sheet at each session showing the five 
words in Thai script with their numerical equivalents.^ 

The results are shown in Table 1, which is arranged as a confusion matrix 
with the tonal stimuli in the first coliuan and the perceptual responses to them 
in the next five columns. The sixth column shows the total number of responses 
to each stimulus word. A perfect response to the tones as intended would yield 



TABLE 1: Confusion matrix of real-speech responses, 

Z Responses 



I 

CO 



Labels; 


Mid 


Low 


Falling 


High 


Rising 


N 


Mid 


9>.9 


2.1 








1220 


Low 


3.4 


96.6 








1220 


Falling 




0.2 


99.1 


0.4 


0.3 


1220 


High 








100. 




1220 


Rising 

_ J 


- 


0.1 


0.4 




99.5 


1220 



Total ■ 6100 
Subjects - 25 
Z Correct - 98.6 



100 percent in each cell along the diagonal from the upper left to the lower 
right. The overall intelligibility of 98.6 percent, 85 errors out of 6100 re- 
sponses, is high. Inspection of the responses to the mid and low tones indi- 
cates that some confusion between them accounts for most of the small number of 
errors. 

The data were also examined for individual differences among speakers and 
listeners. One of the ten speakers caused 45.9 percent of the errors, another 
speaker, 16.5 percent, and a third speaker, 12.9 percent. The recordings of 
only one speaker produced no errors at all* As for Intersubject differences, 
the worst listener made 12.9 percent of the errors, followed by two who each 
made 8.2 percent of the errors. Three of the 25 subjects made no errors at all. 



'The capable and efficient selection and supervision of the test subjects by 
Miss Panit Chotlbut of the Faculty of Humanities, Ramkhamhaeng University, is 
much appreciated. 
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Thia larger sampllog of speakers and listeners supports the earlier conclu- 
sion that Thai words ininloally differentiated by tones can nearly always be 
identified correctly even as isolated forms. Som speakers appear to provide 
minimal perceptual cues for words out of context, especially for the contrast 
between mid and low tones. Some listeners seem to require more than these mini- 
mal cues for the identification task. Filially, the da:.a in Table 1 provide a 
baseline for the other experiments now to be discussed. 

SYNTHETIC SPEECH 

Experiment 2; Perception of fn Contours 

To test once again the perceptual efficacy of the f^ contours derived from 
speech measurements in the earlier study (Abramson, 1962:112-127), a different 
speech synthesizer, the Hasklns Laboratories* formant synthesizer, was used un- 
der control of a computer. For the present experiment, the parao^ters specified 
were the frequency and mplitude values of the first three formants, the timing 
of source functions for voicing and voicelessness , the overall anqplitude of the 
signal, fmd the fundamental frequency (f^,). Steady-state formant frequencies 
were choson to yield a vowel acceptable as Thai /aa/; formant transitions 
(Liberman, Delattre, Cooper, and Geratman, 1954) appropriate to the velar place 
of articulation were used, and voiceless aspiration was simulated by providing a 
long voicing lag (Lisker and Abramson, 1964, 1970) filled with turbulent noise 
in the regions of the formants. The overall amplitude was kept flat throughout 
the syllable except for a slight rise at the beginning and a slight fall at the 
end. These specifications yielded syllables of the type [kha:] which, it was 
hoped, with suitable f^ contours would be heard as the five Thai words listed in 
the section on Isolated Monosyllabic Words. The five fo contours of the 1962 
study (Abramson, 1962:127, Fig. 3.6) were retained with slight adaptations re- 
quired by the nature of the computer program and imposed one-by-one upon this 
syllable. The contours, shown in the upper part of Figure 1, covered a range 
from 92 Hz to 152 Hz, which was reasonable for an adult male voice. Three 
tDkens of each stimulus type thus produced were randomized into six test orders 
and played to 38 native speakers of Central Thai over a period of a month, to- 
gether with other tests in the same sessions. 

The results of these listening tests are shown in Table 2. Note that the 
tonal names in the first column are written with quotation marks. This is meant 
to convey that these £q contours were intended as those tones but can be so 
labeled only to the extent that the subjects accept them as sucn. In the same 
spirit, the word correct at the bottom of the table is also printed with quota- 
tion marks. Otherwise, the form of the confusion matrix is the same as that of 
Table 1. 

Experiment 3t fo Plus Amplitude 

Changes in the contraction of certain laryngeal muscles^ and in subglottal 
air pressure can separately or together produce variations in the fundamental 



A University of Connecticut doctoral dissertation by Donna Erickson, soon to be 
completed, Laryngeal Mechanisms and Coarticulation Effects in the Tones of 
Thai , explores the role of intrinsic and extrinsic laryngeal muscles in the 
production of the tones of Thai. 
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Figure 1: Upper ^etti fg contours used in Experiment 2. Lower part: amplitude 
contours uw.d in Experiment 3 In conjunction with the contours. 
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TABLE 2: Synthetic speech: responses to five contours* 



Z Responses 



Labels: 


Mid 


Low 


Falling 


High 


Rising 


N 


"Mid" 


82.0 


3.5 


0.2 


14.2 




906 


"Low" 


7.2 


87.3 


5.2 


0.3 




906 


"Falling" 


0.4 


0.7 


97.8 


0.6 


0.6 


906 


"High" 


2.0 




0.2 


97.7 


0.1 


906 


"Rising" 


0.2 


0.2 


0.3 


0.1 


99.1 


906 



Total - 4530 
Subjects - 38 
I "Correct" - 92.8 



frequency of the voice. These mechanisms are also available for controlling the 
Intensity of phonation and thus variations In the overall amplitude of the 
speech signal. To a certain extent, then, the two acoustic features, f^ and 
amplitude, may covary. Since the major psychological correlate of amplitude or 
intensity is loudness. Just as that of f^ is pitch, it is not unreasonable to 
suppose that the ear may detect shifts in loudness in conjunction with large 
pitch excursions of one or more of the tones. If this is so, even though it has 
already been demonstrated with synthetic speech that certain ideal fo contours 
carry sufficient Information for the identification of the five tones, the per- 
ceptual processing of some of the tones in real speech may include awareness of 
relative amplitude as a concomitant feature which, under certain conditions, may 
actually contribute to tonal identification. In the present study the question 
arises as to whether the discrepancies in intelligibility between Experiments 1 
and 2 will be removed simply by the addition of appropriate amplitude contours* 

To answer this question, we added the amplitude contours of the lower part 
of Figure 1 to the corresponding f^ contours of the upper part. Amplittule is 
indicated in decibels (db) as decrements from the maximum output of the synthe- 
sizer at zero db. These amplitude contours are approximations derived from in- 
spection of a small sample of amplitude displays made with sound spectrograms of 
Thai words in isolation. The new synthetic stimuli were randomized five times 
each into three test orders and played to 40 native speakers of Central Thai 
over a period of a month, together with other tests. The results are shown in 
the confusion matrix of Table 3. 

Comparison of the Three Conditions 

Inspection of Tables 1-3 shows that the overall identifiabillty of the 
stimuli moves from 98.6 percent for real speech through 92.8 percent for funda- 
mental frequency alone to 96.1 percent for f^ plus amplitude, thus suggesting 
that while f^ alone is by and large a sufficient cue, its efficacy is enhanced 
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TABLE 3: Synthetic speech: responses to fo aiaplltude. 



Labels : 


Mid 


Low 


Falling 


High 


Rising 


N 


"Mid" 


95.7 


1.0 


0.2 


3.0 


0.1 


1555 


"Low" 


7.3 


91.2 


1.2 


0.1 


0.2 


1555 


"Falling" 


0.1 


0.5 


96.6 


2.5 


0.3 


1555 


"High" 


0.2 


0.2 


0.8 


98.5 


0.4 


1555 


"Rising" 


0.2 


0.4 


0.5 


0.2 


98.7 


1555 



Total « 7775 
Subjects - 40 
2 ''Cor ect" - 96 •! 



by the addition of amplitude infoncation. A conpariaon of the correct laean 
scores in the diagonals across the matrices is more to the point than Just the 
overall scores. As we move from real speech (Table 1) to f^ alone (Table 2), 
the most striking changes are in the cells for the mid and low tones which lose> 
respectively^ 15.9 percent and 9.3 percent. In Table 3|for combined with 
amplitude information^ the entries in the corresponding two cells move back in 
the direction of real speech, although the improvement is considerably greater 
for the mid tone. 

Turning to the confusions in the matrices » we note a much greater scatter- 
ing of errors in the two synthetic speech experiments as con^ared with real 
speech. Specif ical iy, in Table 2 there is some confusion between the mid and 
high tones. The intended mid tone is called high 14.2 percent of the time. 
There is also some confusion in the other directioni high heard as mid, but only 
2 percent of the time. Table 3 shows that the addition of amplitude information 
eliminates most of the confusion between these two tones* The only notable con- 
fusion in the real-speech test of Table 1 is between the mid and low tones* 
This confusion is even worse in Table 2. Under both conditions the hearing of 
the intended low tone as mid accounts for most of the confusion. This latter 
effect is not eliminated by the addition of amplitiide information in Experiment 
3 9 although the identification of the intended mid tone itself is now improved 
to the level of real speech or slightly better. The intended low tone is heard 
as falling 5.2 percent of the time in Table 2 but is nearly back to the level of 
real speech in Table 3. One small but puzzling distortion apparently caused by 



An experiment not performed as part of this research would be to try amplitude 
contours alone and then supplement them with f^ information. Previous research 
(Abramsont 1972) iiiq>lied that aiq>litude alone would be not nearly sufficient 
for perception of the tones. 
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the addition o£ amplitude information la the hearing of the intended falling 
tone as high 2.5 percent of the time in Table 3.^ 



It seems safe to infer from the results of the preceding &xperimerr»:s that 
fo contours carry most of the information for tonal identification in Thai; that 
is, they carry sufficient information most of the time to identify words that 
are minimally distinguished by tone. A concomitant feature of isome relevance 
for at least part of the tone system is contained in changes in the overall 
amplitude of the utterance. The confusions in the various matrices indicate the 
need for further information to improve the synthesis of Thai tones by rule. 
The improvements needed may be small refinements of the f^ contours and better 
amplitude specification. In addition, simulating glottal tension In the voice 
source might make the high tone more natural and acceptable in utterance- final 
position. 

Experiment 4; Perception of fn Levels 

The five tones of Central Thai can be viewed as falling into two groups, 
the dynamic tones and the static tones (Abramson, 1962:9-11). In this scheme, 
the sharp upward and downward movements of the rising and falling tones place 
them in the dynamic category. Since the high, mid, and low tones often sound as 
if they simply occupy three levels, they are classified as static. Of course, 
the acoustical measurements, as reflected in the upper part of Figure 1, show 
that even the static tones undergo some fg movement. Kenneth Pike (1948:5) 
speaks of level tonemes and gliding tonemes: "a LEVEL toneo^ is one in which, 
within the limits of perception, the pitch of the syllable does not rise or fall 
during its production. A GLIDING toneme is one in which during the pronuncia- 
tion of the syllable in which it occurs there is a perceptible rise or fall, or 
some combination of rise and fall, such as rising-falling or falling-rising." 
It may be the case, then, that speakers of the language, not to mention field 
phoneticians, hear the static tones of Thai as simple levels. A lengthy quota- 
tion fro&i Pike (1948:4) is of considerable interest here: 

Tone languiiges have a major characteristic in common: It is the 
relative height of their tonemes, not their actual pitch, which is 
pertinent to their linguistic analysis. It Is immaterial to know the 
number of vibrations per second of a certain syllable. The Important 
feature is the relative height of a syllable in relation to preceding 
and following syllables. It is even Immaterial, on this level of 
analysis (but not in the analysis of the linguistic expression of 
emotion), to know the height of a specific syllable in proportion to 
the general average pitch which the speaker uses. Rather, one must 



The effects set forth here seem obvious from simple inspection of the tables. 
Indeed, statistical analysis (t-tests for differences between correlated means) 
performed with the kind help of Dr. Lyle Bachman of the Central Institute of 
English Language and the Ford Foundation, Bangkok, shows them by and large to 
be significant at the 5 percent level of confidence or better. More refined 
statistical procedures might lead to further observations, but such ef feces 
would probably be so subtle as to be uninteresting for our understanding of the 
perception of tones. 
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know the relatlonahlp of one specific ftyllable to the other ayllablet 
In the specific context In the particular utterance* A man and a 
wooan may both use the saii» toremes, even though they apeak on differ- 
ent general levels of pitch. Either of them may retain the same 
£onemes vhlle lowering or raising the voice In general, since It la 
the relative pitch of syllables within the ioDedlate context that 
constitutes the essence of tonemlc contrast. 

Use Lehlste (1970:Chap. 3) provides a useful survey of these matters and re- 
lated questions. 

How likely is it that absolute values of f^ levels provide sufficient cues 
for tY'^ recognition of the tones of a given speaker of a tone language in a cer- 
tain context? The relativity of the pitchea of tones (Cook, 1972) is usually 
taken for granted. Indeed, the few studies that have yielded acoustl" asure- 
ments of tones, e.g., for Mandarin Chinese (Howie, 1972) and Thai (Aw-.-wwu, 
1962; Erickson, 197A), show that the tones are characterized by at least some 
movement and, in some cases, much movement. That is, they tend not to be 
uttered with a flat, unchanging f^. The one tone in Thai that appears most 
likely to have a flat f^ in certain nonflnal environments is the mid tone. In 
an interesting experis^nt, Victor Zue (discussed in Klatt, 1973) demonstrated 
that Howie* 8 Mandarin contours still showed rather high intelligibility even 
when the range of absolute f^ is severely compressed, thus -indicating that the 
pitch movement still available to the listeners carried sufticlent information. 

Of course, most of the foregoing observations are derived from tones in 
isolated words or at least in very short utterances. It seems likely that at 
least for some of the tones of Thai, presumably the high, mid, and low, the per- 
turbations of their fg contoura occasioned by the many coartlculations of running 
speech should produce flat variants here and there (Abramson, in preparation). 
If 80, are such words understood by virtue of contextual redundancy or, to re- 
turn to the question raised in the preceding paragraph, do the absolute levels 
furnish sufficient cues? That is, are "level" pitches assigned to tones only 
when they are the perceptual responses to small fg siovements, or will true acous- 
tic levels suffice? Experiment 4 was designed to provide some answers to the 
question. 

As shown in Figure 1, the "voice" of the synthesizer in the experiments re- 
ported so far was set to range from 152 Hz down to 92 Hz. In this experiment 
the range was divided to produce 16 flat fundamental frequendea in atepa of 
4 Hz; the amplitudes were flat too, except for a slight rise at the beginning and 
a slight fall at the end. These variants were imposed upon the same basic syl- 
lable, raudomized, and played to 37 native speakers of Thai for identification 
as menibers of the same set of five words as before. Table 4 reveals that the 
falling and rlaing tonea, trtiich are characterized by very dynsalc movementSt 
elicited practically no responses. Indeed, the 0.1 percent response to the 
ICG Hz level as the rising tone is so improbable as to suggest momentary inat- 
tention* From top to bottom, we have here a gradual crossover from the high 



An Important background article for the phonological treatment of tone is 
Wang (1967). 
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TABLE 4: Synthetic speech: responBes to 16 level fo*8. 



% Responses 



Hz 


Mid 


Low 


Falling 


High 


Rising 


N 


152 


8.0 


4.1 


0.2 


87.7 




903 


148 


7.9 


4.2 




87.6 




903 


144 


8.6 


4.1 


0.2 


87.0 




903 


140 


12.1 


4.3 


0.1 


83.5 




903 


136 


18.6 


5.4 


0.1 


75.9 




903 


132 


29.2 


5.4 




65.3 




903 


128 


49.3 


6.2 




44.5 




903 


124 


65.3 


5.5 


0.1 


29.0 




903 


120 


72.6 


6.3 


0.1 


20.9 




903 


116 


73.0 


12.2 




14.7 




903 


112 


66.4 


19.7 




13.8 




903 


108 


42.1 


45.7 




12.2 




903 


104 


18.4 


73.8 




7.9 




903 


100 


11.0 


81.5 


0.3 


7.1 


0.1 


903 


96 


5.5 


88.7 


0.1 


5.6 




903 


92 


4.8 


90.1 




5.1 




903 



Total - 14,448 
Subjects - 37 
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tone through the old tone to the low tone. Novhere Is 100 percent Identifica- 
tion as a particular tone achieved. The closest Is a peak of 90.1 percent for 
the lov tone, ufalch Is comparable with the peak of 87.3 percent for the low tone 
In Experiment 2 (Table 2). The other two peaks here, 73 percent for the aid 
tone and 87.7 percent for the high tone* compare somewhat leas well with their 
counterparts, 82 percent and 97.7 percent respectively, among the Ideal contours 
of Experiment 2. Despite these peaks. It Is Important to note that all three 
tones persist In eliciting responses throughout their ranges, tfost of this Is 
&ccounted for not by the sporadic responses of all the subjects but rather the 
deviant response behavior of three of them. One subject agreed with the main 
group In calling the upper part of the range the high tone, but at 120 Hz she 
started crossing over to the low tone and remained there the rest of the way 
down with only scattered responses In the mid-tone column. The second subject 
of the three called the upper part of the range the mid tone and crossed over to 
the low tone at 108 Hz. The third subject deviated In the most surprising way 
from the performance of the main group: she assigned the upper part of the 
range to the low tone and, crossing over at about 120 Hz, the rest of the range 
to the high tone! (This subject's correct use of labeling conventions In other 
tests taken during the saste sessions shows that she Is not guilty of misuse of 
labels here.) Apparently, however, her psychological set shifts from time to 
time, because in soo^ of the test sessions she assigned variations in the lower 
part of the range to the mid tone. 

We may infer from the results of Experiment 4 that even In Isolated mono- 
ayllabic words unchanging levels of fundamental frequency can carry considerable 
Information as to the Identity of the high, mid, and low tones. We suppose that 
in such a situation some accommodation to the apeaker'a pitch range is neces- 
sary. The subjects vho took these tests were quite used to the "voice" of the 
synthesizer, and care was taken to confine the absolute levels of fg to the 
range already in use In other tonal experiments. At the same time, the fact 
that there Is no gliding movement at all in the stimuli aeems to cause a certain 
amount of confusion across the three categories that the subjects accepted. In- 
deed, for three of the 37 subjects this factor was perceptually very disrupting* 
They may represent a population of Thai speakers for whom the snail gliding 
movements normally found in productions of the static tones in Isolation are 
essential for correct Identification. In the absence of any f^ movement at all, 
it is not surprising that the falling and rising tones were not used as response 
categories. 

Conclusion 

The experiments described here lead to a few general conclusions about some 
aspects of the perception of the cones oZ Thai. Firat of all, it is more evi- 
dent than heretofore that the xntelliglbility of tonally differentiated monosyl- 
lables presented in isolation is quite high. In addition, the average fundamen- 
tal frequency contours obtained some years ago (Abraiason, 1961, 1962), when 
applied aa Instructions to the parameters! of a different synthesizer (and there- 
fore presumably other synthesizers), still carry enough information for accept- 
able synthesis of Thai words. A slight reduction of the discrepancy in Intellig- 
ibility between these contours and those of real speech can be effected by add- 
ing rough approxlmationa of natural amplitude movements often found in correla- 
tion with shifts in fundamental fr<*quency. To eliminate the small remaining 
discrepancy further work Is needed. Finally, a continuum of level fundamental 
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frequencies can be divided perceptually by native speakers of Thai into the 
high, mid, and low tones* but with very gradual transitions and with rather 
aberrant response patterns on the part of some test subjects; these data sug- 
gest that while levels, even in isolated syllables, can carry much information 
about these tones, there is still a fair amount of interference from the abnor- 
mal lack of change in fundamental frequency. 

Given the paucity of similar perceptual data for other tone languages, it 
is hard to say how we may generalize these findings beyond the Thai language. 
In fact, even within Thai there is little information about other major regional 
dialects. ^2 Such phonetic features as "creaky voice" or "glottal tension" prom- 
inent in sos» tonal systems will probably require parameters other than simple 
control of fundamental frequency or amplitude for experimental investigation. 
Glottal tension is found in the high tone of Central Thai but appears to be un- 
stable. Reports on more complicated manipulations of fundamental frequency to 
elucidate further the nature of tonal perception in Thai will be forthcoming. 
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Phonetic Segiaentatlon and Recoding in the Begin, lag Reeder* 

Isabelle Y. Liberoan/ Donald Shankweller,"*^ Alvln M. Llbenaan,"*^ Carol Fowler, 
and F. William Fischer 



The beginning reader — the child of aix or thereabouts — is an accompliahed 
speaker-hearer of his language and has been for a year or nore. Why, then, 
should he find it hard to read, as so many children do? Why does he not learn 
to read as naturally and inevitably as he learned to speak and listen? What 
other abilities, not required for mastery of speech, must he have if he is to 
cope with language in its %fritten form? 

If the beginning reader is to take greatest advantage of an alphabet and of 
the language processes he already has, he oust convert print to speech or, more 
covertly, to the phonetic structure that, in some neurological form, must be 
presumed to underlie and control overt speech articulation. In the first part 
of the paper ve will say why it might be hard to make the converaion properly— 
that is, so as to gain all the advantages that an alphabetic system offers. But 
the conversion from print to speech, whether properly made or not, may also be 
important to the child in reducing what la read to a meaningful message. This 
is so because of a basic characteristic of language: the meaning of the longer 
segments (for example, sentences) transcends the mesnlng of the shorter segments 
(for example, words) out of which they are formed. Frcm that it follows that 
the shorter segments must be held in some short-term store until the meaning of 
the longer segments has been ccmputed. In the second part of the paper we will 
consider the possibility that a phonetic representation is particularly suited 
to that requirement. 

In referring to the conversion of print to speech, which is what much of 
this paper is about, we will not be especially concerned to make a distinction 
between overt speech and the covert neurological processes (isomorphic, presum- 
ably, with the phonetic repreaeutatibn) that g<wem its production and percep- 
tion. We should only note that the beginning reader often converts to overt 
speech and the skilled reader to soae more covert form. We should also note 
that the conversion to the covert form does not, of course, limit the reader to 
the relatively slow ratea at which he can overtly articulate. We will also not 
be concerned with the distinction between the phonetic and the more abstract 
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phonological representations. Like many alphabetically written languages, 
English makes contact, not at the phonetic level, but at some more abstract re- 
move, cloaer surely to the level of systeaatlc phonologic structure (Chomsky, 
1970; Kllma, 1972) or. In the older terminology, to the phonemic and morphopho- 
nemic levels. That Is an Important consideration to students of the reading 
process, but It happens not to be especially relevant to our purposes in this 
paper. For convenience, then, we will speak of phonemes, phonetic segments, and 
phonetic structure without implying any differences in the abstractnesa of the 
units being referred to. 

USING THE ALPHABET TO FULL ADVANTAGE 
The need to seRment phonetically 

For the moment we will concern ourselves only with the first problem: what 
a child needs in order to read an alphabetic language properly. In that connec- 
tion, let us look at the strategies the beginning reader might use to recover a 
phonetic representation of the written word. There are at least two possibili- 
ties: the child might work a^ialytically, by first relating the orthographic 
components of the written word to the segmental structure of the spoken word, or 
he might do it hollstlcally, as in the whole-word method, by simply associating 
the overall shape of the written word with the appropriate spoken word. In the 
whole-word strategy, the child not only does not need to analyze words into 
their phonetic components, but need not necessarily even be aware that such an 
analysis can be made. There are, however, many problems with this strategy. An 
obvious one, of course, is that it is self-limiting; it does not permit the 
child to take advantage of the fact that his language is written alphabetically. 
In the whole-word strategy, each new word must be learned as a unit, as if it 
v/ere an ideographic character, before it can be read. Only if the child uses 
the more analytic strategy can he realize the important advantages of an alpha- 
betically written language. Thus, given a word which he has heard or which is 
already in his lexicon, the child can read it without specific instruction, 
though he has never before seen it in print; or, given a new word which he has 
never before heard or seen, the child can closely approximate its spoken form 
and hold that until its meaning can be inferred from the context or discovered 
later by asking someone about it. In connection with the latter advantage, one 
might ask why the child cannot similarly hold the word in visual form. Perhaps 
he can. We know, however, that the spoken form can be retained quite easily and. 
Indeed, that it can readily be called up. As to what can be done with a purely 
visual representation, we are not so sure. At all events, and as we will say at 
greater length later, spoken language, or its underlying and covert phonetic 
representation, seems particularly suited for storage of the short-term variety. 

What special ability does the child need, then, if he is to employ the 
analytic strategy and thus take full advantage of the alphabetic way our lan- 
guage is written? In our view, it is the ability to make explicit the phonetic 
segmentation of his own speech. Consider, for example, what is Involved in 
reading a simple word like bag . Let us assume that the child can identify the 
three letters of the word, and further, that he knows the individual letter- to- 
sound correspondences — the sound of b is /bA/, the sound of a is /as /, and ^ is 
/gA/. If that is all he knows, however, he will sound out the word as buhaguh . 
a nonsense trisyllable containing five phonetic segments, and not as bag , a 
meaningful monosyllable with only three phonetic segments. If he is to map the 
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printed, three-letter word ba& onto the spoken word bat^ , which is already in his 
lexicon, he must know that the spoken syllable also has three segments. 

The difficulties of making phonetic segmentation explicit 

Given that the child must be able to make explicit the phonetic segmenta- 
tion of the word, is there any reason to believe tha^ ne might encounter diffi- 
culties? There is, indeed, and it comes directly from research on acoustic cues 
for speech perception — the finding that there is, most commonly, no acoustic 
criterion by which the phonetic segmentation of a given word is dependably 
marked (Liberman, Cooper, Shankweiler, and Studdert-Kennedy, 1967). Phoneme 
boundaries are not marked acoustically because the segments of the phonetic mes- 
sage are often coarticulated, with the result, for example, that a consonant seg- 
ment will, at the acoustic level, be encoded into— that is, merged with— the 
vowel. The word bag , for ex&a4>le, has three phonetic segments but only one 
acoustic segment. Thus, there is no acoustic criterion by which one can segment 
the word into its three constituent phonemes* Analyzing an utterance into syl- 
lables, on the other hand, may present a different and easier problem. We 
should expect that to be so because every syllable contains a vocalic nucleus 
and thus will have, in most cases, a distinctive peak of acoustic energy. These 
energy peaks provide audible cues that correspond approximately to the syllable 
centers (Fletcher, 1929). Though such auditory cues could not in themselves 
help a listener to define exact syllable boundaries^ they ought to make it rela- 
tively easy for Jiiim to discover how many syllables there are and, in that sense, 
to do explicit syllable segmentation. 

Ue should remark here that the analytic strategy we have been talking about 
does not mean reading letter by letter. Indeed, if the child is using the an- 
alytic strategy he most certainly cannot read that way. Sounding out the letters 
would produce nonsense, as in the example of buhaguh (for bag) offered above. 
Given the way phonetic segments are encoded or merged at the level of sound, the 
spoken form can be recovered only if, before making the conversion, the reader 
takes into account all the letters that represent the several phonetic segments 
to be encoded. In the example of bag , the coding unit is obviously the syllable* 
But coding influences sometin^s extend across syllables, and in the case of pro- 
sody such influences may cover quite long stretches. Ue think, therefore, that 
the number of letters that must be apprehended before attempting to recover the 
spoken form may SOTetimes be quite large. In fact, we do not now know exactly 
how large these coding units are, only that they almost always exceed one letter 
in length. To identify such units is, in our view, a research undertaking of 
great importance and correspondingly great difficulty. 

It should also be emphasized here that the child who finds it difficult to 
make explicit the phonetic segmentation of his speech need not have any problems 
at all in the regular course of speaking and listening. Children generally dis- 
tinguish (or identify) words like bad or bag , which differ in only one phonetic 
segment. Indeed, there is evidence now that infants at one month of age dis- 
criminate ba from £a (and da from ta) and, mrreover, that they make this discrim- 
ination categorically. Just as adults do (Eliuas, Siqueland, Jusczyk, and 
Vigorito, 1971). The child has no difficulty In speaking and listening to speech 
because there the segmentation of the largely continuous acoustic signal is done 
for him automatically by operations of which he is not conscious. In order to 
speak and listen, therefore, he need have no more conscious awareness of phonetic 
structure than he has of syntactic structure. We all know that the child can 
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speak a graouaatlcal sentence without being able to verbalize the rules he is 
using to form that sentence* Similarly » he can readily distinguish bad from bag 
without being able to analyze the phonetic structure underlyixig the distinction-^ 
that isy without an explicit understanding of the fact that each of these utter- 
ances consists of three, segments and that the difference lies wholly in the 
third. But reading, unlike speech, does require an eicplic^.t analysis if the ad- 
vantages of an alphabet are to be realized « 

That explicit phonetic analysis might be difficult is suggest^ also by the 
history of writing (Gelb, 1963). In the very earliest systems the segment that 
the orthography represented was the word* Present-day approximations to that 
kind of writing are to be found in Chinese characters and in the very similar 
kanji that the Japanese use* Writing with n^aningless units is a more recent 
development, the segment size represented in all the earliest forms being the 
syllable* An alphabet, representing the shortest s^aningless segments (phones 
or phonemes), developed still later and apparently out of a syllabary* Moreover, 
all the other systems, whether comprising meaningful or meaningless units, and of 
whatever size, seem to have appeared Independently in various places and at vari- 
ous times, but all the alphabets are considered to have been derived from a 
single original invention* It seems reasonable to suppose that the historical 
development of writing systems— from word, to syllable, to phoneme— might reflect 
the ease or difficulty of explicitly carrying out the particular type of segmen- 
tation that each of these orthographies requires* More to the point of our pres- 
ent concerns, one would suppose that for the child there might be the same order 
of difficulty, and, correspondingly, the same order of appearance in development* 

Development of the ability to analyze speech into phonemes and syllables 

We thus have reason to suppose that phonetic segmentation might be a diffi- 
cult task, more difficult than syllabic segmentation, and that the ability to do 
It might, therefore, develop later* To test that supposition directly, we re- 
cently conducted an experiment* The point was to determine how well children in 
nursery school, kindergarten, and first grade (four-, five-, and six-year-olds) 
can identify the number of phonetic segments in spoken utterances and how this 
compares with their ability to deal similarly with syllables (Liberman, 
Shankweiler, Fischer, and Carter, 1974)* The procedure was In the form of a 
game which requirred the child to indicate, by tapping a wooden dowel on the 
table, the nuihber (from one to three) of segn^nts (phonemes in the case of one 
group, syllables in the other) in a list of test words* To teach the child what 
was expected of him, the test list was preceded by a series of training trials 
in which the experimenter demonstrated how the child was to respond* The test 
itself consisted of 42 randomly assorted individual items of one, two, or three 
segments, presented without prior demonstration and corrected, as needed, 
Ixpmediately after the child's response* Testing was continued through all 42 
items or until the child reached a criterion of tapping six consecutive trials 
correctly without demonstration* The children of each grade level were divided 
into two experimental groups, the one requiring phoneme segmentation and the 
other, syllable segmentation* Instructions given the two groups were identical, 
except that the training and test items required phoneme segmentation In one 
group and syllable segmentation in the other* 

The results showed in more than one way that the test words were more readi- 
ly segmented into syllables than into phonemes* At all grade levels, the number 
of children who were able to reach criterion was markedly greater in the group 
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required to segment by syllable than in the group required to segment by phoneme* 
At age four, none of the children could segment by phoneme, whereas nearly half 
could segment by syllable. Ability to carry out phoneme segmentation success- 
fully did not appear until age five, and then it was demonstrated by only 17 per- 
cent of the children. In contrast, almost half of the children at that age could 
segment syllabically. Even at age six, only 70 percent succeeded in phoneme seg- 
mentation, while 90 percent were successful in the syllable task. 

The proportions of children at each age who reached criterion level in the 
minimum number of trials is another measure of the contrast in difficulty of the 
two taslcs. For the children who worked at the syllable task, the percentage 
reaching criterion in the minimum tls^ increased steadily over the three age 
levels: 7 percent at age four, 16 percent at age five, and SO percent at age 
six. By contrast, we find in the phoneme group that no child at any grade level 
attained the criterion in the minimum time. 

The data were also analyzed in terms of mean errors. In Figure 1 mean 
errors to passing or failing a criterion of six consecutive correct trials with- 
out demonstration are plotted by task and grade. Errors on ooth the syllable 
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Figure 1: Mean number of errors to passing or failing a criterion of six 
consecutive trials without demonstration in phoneme and syllable 
segmentation. 
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and phoneme tasks decreased laonotonlcally at successive grade levels, but the 
greater difficulty of phoneme segmentation at every level was again clearly 
demonstrated. 



Segmentation and reading 

The difficulty of phonetic segmentation has also been remarked by a nussber 
of other Investigators (Rosner and Simon, 1970; Calfee, Chapmau, and Venezky, 
1972; Savin, 1972; Gleltman and Rozln, 1973; Elkonln, 1973; Gibson and Levin, In 
press). Their observations, together with ours described In the experiment 
above, also Imply a connection between phonetic segmentation ability and early 
reading acquisition. This relationship Is suggested In our experiment by the In- 
crease In number of children passing the phoneme-counting task, from only 17 per- 
cent at age five to 70 percent at age six. Unfortunately, the nature of the con- 
nection Is In doubt. On the one hand, the Increase In ability to segment pho- 
netically might result from the reading Instruction that begins between five and 
six. Alternatively, It might be a manifestation of some kind of Intellectual 
maturation. The latter possibility might be tested by a developmental study of 
segmentation skills In a language cosmunity such as the Chinese, where the ortho- 
graphic unit Is the word and where reading Instruction therefore does not demand 
the kind of phonetic analysis needed In an alphab itlc system.^ 

In any event, since explicit phoneme segmentation Is harder for the young 
child and develops later than syllable segmentation, one would es^ect that sylla- 
ble-based writing systems would be easier to learn to read than those based on an 
alphabet. We may thus have an explanation for the assertion (Haklta, 1968) that 
the Japanese kana, roughly a syllabary. Is readily mastered by first-grade chil- 
dren. One might expect, furthermore, than an orthography which represents each 
word with a different character (as Is the case in Chinese logographs and In the 
closely related Japaxiese kanjl) would obviate the difficulties In Initial learn* 
ing that arl^^e in mastering an alphabetic system. The relative ease with which 
reading-disabled children learn kanjl-llke representations of language while be- 
ing unable to break the alphabetic code (Rozln, Porltsky, and Sotsky^ 1971) may 
be cited here as evidence of the special burden Impos.ed by an alphabetic script. 

However, we need not go so far afield to collect Indirect evidence that the 
difficulties of phoneme segmentation may be related to early reading acquisition. 
Such a relation can be Inferred from the observation that children who are resis- 
tant to early reading Instruction have problems even with spoken language when 
they are required to perform tasks demanding some rather explicit understanding 
of phonetic structure. Such children are reported (Honroe, 1932; Savin, 1972) to 
be deficient In rhyming. In recognizing that two different monosyllables may 
share the same first (or last) phoneme segment, and also In speaking Pig Latin, 
which demands a deliberate shift of the Initial consonant segment of the word to 
Initial position In a nonsense syllable added to the end of the word. 

tfe, too, have explored directly. If In a preliminary way, the relation be- 
tween ability to segment phonemes and reading. For that purpose we measured the 
reading achievement of the children who had taken part In our experiment on 



Unfortunately, a y^Me test will be hard to make. Children In the People's 
Republic of China are taught to read alphabetically before beginning their study 
of logographlc characters. 
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phonetic segmentation* described above. Testing at the beginning ol the second 
school year, we found that half of the children in the lowest third of the class 
in reading achievement (as measured by the word-recognition task of Che Wide 
Range Achievement Test) had failed the phoneme segmentation task the previous 
June; on the other hand, there had been no failures in phoneme segmentation 
among the children who scored in the top third in reading ability (I. Llberman, 
1973) . 

Data from the analysis of children's reading errors may also be cited as 
additional evidence for the view that explicit phoneme segmentation may be a 
serious roadblock to reading acquisition. If a chief source of reading diffi- 
culty is that the child cannot make explicit the phonetic structure of the lan- 
guage, he might be expected to show success with the initial letter—which re- 
quires no further analysis of the syllable— and relatively poor perfonnance be- 
yond that point. If he knows some letter-to-sound correspondences, and knows 
that he must scan in a left-to-right direction, he might simply search his lexi- 
con for a word, any word, beginning with a phoneme that matches the Initial 
letter. Thus, presented with the word bag , he might give the response butterfly . 
Such a response could not occur if the child were searching his lexicon for a 
word with three sound segments corresponding to the letter segments in the 
printed word. If, however, the child is unaware that words in his lexicon have 
a phonetic structure or If he has difficulty in deteiminlng what that structure 
is, then he will not be able to map the letters to the segments in these words. 
On these grounds, we would expect that in reading words such a child would make 
more errors on final consonants than on initial consonants. We have observed 
Just this error pattern In a number of beginning and disabled readers aged seven 
to eleven (Shankweller and Llberman, 1972; Llberman, 1973). 

Further evidence comes from a recently completed study (Fowler, Llberman, 
and Shankweller, in preparation) which showed that although the error rate in 
reading decreases markedly with grade level, the position effect (i.e., the dis- 
crepancy in error rate between initial and final consonants) Is maintained as 
the child progresses through the early grades. The subjects in this study were 
second, third, and fourth grade children. The list of words to be read consisted 
of 38 monosyllables selected to give equal representation to the 19 consonant 
phonemes that can occur in both initial and final position in English words. 
Each phoneme was represented twice in the list in each position. The words were 
presented to the child singly to be read aloud to the best of his ability. 

Analysis of the data shows final consonant errors to be at least twice as 
frequent as initial. At Grade 2, the rate of initial consonant errors (IC) was 
8 percent as cotapaxed with 16 percent for final consonants (FC); at Grade 3, IC 
was 5 percent, FC 10 percent; at Grade A, IC was 2 percent, FC 6 percent. It 
was clear that the FC-IC difference could not be accounted for by differences in 
the phonetic complexity of the consonants that tend to occur in Initial and final 
position, because the consonant phonemes in the test list were controlled for 
frequency of occurrence and position in the word. But what about orthographic 
complexity? It was possible that the FC/IC difference might be due to the fact 
that a given phoneme occurring finally is spelled more complexly than that same 
phoneme in the initial position (g and J versus dge and ge). Ve therefore looked 
only at the errors on phonemes that are spelled siovly (by a single letter) in 
both Initial and final position (p,t,k,b,d,g,m,n,r) . If the position effect had 
been due largely to orthographic complexity, it should have disappeared in this 
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analysis. But It did not. Final consonants still produced more errors than 
Initial. 

It Is clear that there Is Indeed a progression of difficulty with the posi- 
tion of the consonant segnent in the word, the final co:\sonants helng nore fre- 
quently misread than the Initial. Similar findings haw. been reported by other 
Investigators (Daniels and Dlack, 1956; Weber, 1970) who examined error patterns 
In the reading of connected text. We found In an earlier study (Shankweller and 
Llberman, 1972) that the Inltlal-'flnal difference cannot be a simple reflection 
of the error pattern in F«peech. There we presented, first for oral repetition 
and then for reading, a list of 204 monosyllables chosen to give equal iiapresen- 
tatlon to most of the consonants, consonant dusters, and vowels of English. 
The Inltlal-flnal consonant error pattern was duplicated In reading* but In oral 
repetition the consonant errors were about equally distributed between Initial 
and final position. Moreover, the Inltlal-flnal error pattern In reading Is also 
contrary to what would be expected In terms of sequential probabilities. If the 
child at the early stages of beginning to read were using the constraints built 
Into the language, he would make fewer errors at the end than at the beginning 
of words, not more. 

The contribution of orthographic complexity 

In stressing the difficulty of phonemic segmentation, we do not Intend to 
loqply that no other problems are Involved In reading an alphabetic language. For 
example, we realize that the mapping In English between spelling and language Is 
sometimes complex and Irregular.^ Although that undoubtedly contributes to the 
difficulties of reading acquisition, we do not believe that the complexity of 
the orthography Is the principal cause. Indeed* we know that It cannot be the 
only cause since many children continue to have problems even when the words are 
carefully chosen to Include only those which map the sound In a consistent way 
and are part of the child's active vocabulary' (Savin, 1972). Hoveover, reading 
problems are known to occur In countries In w.ilch the writing system maps the 
language more directly than In English (Downing, 1973). In any event, the major 
Irregularities of English spelling confronting the young child In the slnple 
words he must read have to do mainly with the vowels. 

Though we believe It to be of Interest to examine the relation of ortho- 
graphic complexity of the vowels to the problems of reading acquisition, and we 
are doing so (Fowler et al., in preparation), we suspect that getting the vowel 
exactly right may not be of critical Importance In reading (though, of course. 
It Is In spelling). If In the conversion to sound the child gets the phonetic 
structure correct except for errors In vowel color, he would not be too wide of 
mark, and many such errors would be rather easily corrected by context or by In- 
formation obtained later. 



It Is recognized that the "Irregularities" of English spelling are more lawful 
than might appear, as In the spellings of "sign" and "signal*" for example, 
which reflect morphological structure quite accurately (Chomsky, 1970). How- 
ever, It must be said that this lawfulness can be appreciated only by the 
skilled reader and probably does not aid the beginner. 
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THE PHONETIC REPRESENTATION. SHORI-TEBM MEMOHY. AND RSAKINC 



Phonetic recodlng In reading aa a way to tap primary language proceages 

Though beginning readers must surely recode phonetically if they are to 
cope with new words, we wonder what they do with w>rd8 (and phrases) they have 
read many times. Do they, in those cases, construct a phonetic representation, 
using either of the two strategies we described earlier, or do they, as some be- 
lieve (Bever and Bower, 1966), go directly from print to meaning? 

One can think of at least two reasons why phonetic receding might occur 
even with frequently read materials* A not very interesting reason is that, 
having adopted the phonetic strategy to gain advantages in the early stages of 
learning, the reader continues with the habit, though it may have ceased to be 
functional or may even have become, as some might think, a liability. There is 
a more interesting reason, however, and one we are inclined to take more seri- 
ously. It derives from the possibility that working from a phonetic base is 
natural and necessary if the reader (including even one who is highly practiced) 
is to take advantage of the primary language processes that are so deep in his 
experience and, indeed, in his biology. Consider, for example, that the normal 
processes for storing, indexing, and retrieving lexical entries may be carried 
out on a phonetic base. If so, it is hard to see why the reader should develop 
completely new processes, suited for the visual system, and less natural, pre- 
sumably, for the linguistic purposes than the old ones. Or consider what we 
normally do in coping with syntax, an essential step in arriving at the meaning 
of a sentence. Though we do not know much about how we decode syntax, it is 
virtually certain that we are aided significantly by the prosody, which marks 
the syntactic boundaries. What, then, is the cost to our understanding of what 
we read if we do not recover the prosody, using for that purpose the marks of 
punctuation and such subtle cues as skillful writers may know how to provide 
(Bolinger, 1957)7 

There are, of course, other natural language processes that the reader can 
exploit only by constructing a phonetic representation. Among them is short- 
term storage, and it is that process we will be concerned with in the remainder 
of this paper. As we pointed out earlier, it Is characteristic of language that 
the meaning of longer segments (e.g., sentences) transcends the meaning of the 
shorter segments (words) from which they are formed. It follows, then, th^t the 
listener and reader must hold the shorter segments in some short-term store if 
the meaning of the longer segments is to be extracted from them. Given what we 
know about the characteristics of the phonetic representation, we might suppose 
that, aa Liberman, Mattlngly, and Turvey (1972) have suggested, it is uniquely 
suited to the short-term storage requirements of language. But apart from what 
we or they might suppose, there is relevant experimental evidence. 

Phonetic representation of visually presented material in short-term memory 

Some of the evidence comes from a class of experiments showing that when 
lists of letters or alphabetically written words are presented to be read and re- 
membered, the confusions in short-term oiemory are phonetic rather than optical 
(Conrad, 1963, 1964, 1972; Sperling, 1963; Conrad and Hull, 1964; Conrad, 
Freeman, and Hull, 1965; Baddeley, 1966, 1968, 1970; DomiS, 1967; Hintzman, 
1967; Kintsch and Buschke, 1969; Thomasson, 1970, reported in Conrad, 1972). 
From that finding it has been Inferred that the stimulus items had been stored 
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in phonetic rather tlw vIsuaI form. Indeed, the tendency to recede vlaually 
presented Iteaa Into phonetic fora la ao atrong that, as Conrad (1972) haa ear 
phaalsed, aubjecta conalatently do ao recede even In experimental altuatlons In 
which It la clearly dlaadvantageoua to do ao. 

A alallar kind of experiment (Erlckaon, Kattlngly, and Turvey, 1973) aug- 
geata that exactly the aaae kind of phonetic receding occura even vfaen the lln- 
gulatlc atloull are not preaented In a form (alphabetic) that repreaenta the 
phonetic atructure. In that experiment the Inveatlgatora uaed liata of kanjl 
charactera, which are eaaentially logographlc, and Japaneae aubjecta who were 
readers of kaujl. As In the experiments with alphabetically apelled worda, 
there was evidence that the stimulus items had been atored in abort- term memory 
in phonetic rather than visual (or semantic) form. 

There la alao evidence that even nonlingulstic stimuli may, under some cif 
cumatancea, be receded into phonetic form and ao atored in ahort-term memory. 
That evidence comes from work by Conrad (1972) who found that In ahort-term re- 
call of pictures of coim&on objects, confusions were clearly baaed on the phonet- 
ic forma of the names of the objecta, rather than on their visual or semantic 
characteriatics . 



Though none of the experiments cited here dealt with natural reading situa- 
tions, they are nevertheless relevant to the aaaumptlon that even akilled read- 
era might recede phonetically, and that in so doing they might gain an advantage 
in short-term memory. It remains to be determined whether and to what extent 
readers rely on phonetic receding for the short-term memory requirements of nor- 
mal reading. Leas generally, it remains to be determined alao whether good and 
poor readers are distinguished by greater or lesaer tendencies toward phonetic 
receding. In the next section of this paper we will describe our first attempt 
to gain evidence on this question. 

Phonetic receding in good and poor beginninR readers; an experiment 

Given the ahort-term memory requirements of the reading taak and evidence 
for the involvement of phonetic coding in ahort-term storage, we might expect to 
find that those beginning readera who are progreaaing well and those who are do- 
ing poorly will be further distinguished by the degree to which they rely on 
phonetic receding. To our knowledge no one haa Inveatigated thla posalbUity; 
conaequently, we set out to do so. Our experiments will be described In detail 
elaewhere (Llberman, Shankweiler, Fowler, and Flacher, In preparation); here we 
will report briefly on only one experiment which la directly relevant to our 
preaent concema. 

In thla experiment, we used a procedure almilar to one devlaed by Conrad 
(1972) in which the aubject*a performance la compared on recall of phonetically 
conf usable (rhyming) and nonconfuaable (nonrhymlng) lettera. Our expectation 
was that phonetically almilar itema would maximize phonetic confuaabllity and 
thua penalize recall in aubjecta who uae the phonetic code in ahort-term memory. 
Sixteen atringa of five upper-caae lettera were preaented to the atibjecta by 
projector tachlatoacope. Eight of the five-letter atringa were coiqKiaed of 
rhyming conaonanta (drawn from the aet: BCD6PTVZ) and eight were compoaed 
of nonrhymlng conaonanta (drawn from the aet: HKLQRSWY). The two aeriea 
of five-letter atringa (confuaable and nonconfuaable) were randomly interleaved. 
An expoaare time of 3 aec waa adopted after preliminary atudiea had ahown that 
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even «duit subjects require exposures in excess of 2 sec In order to report all 
five letters reliably. The test was given twice: first with Issaedlate recall, 
then with delayed recall. In the first condition, recall was tested by having 
subjects print each letter string, In the order given, lanedlately after presen 
tatlon. In order to aake the task naxlaally sensitive to the recall strategy* 
we then laposed a 15-sec delay between tachlstoscoplc presentation and the re- 
sponse of writing down the string of letters. 

As can be seen In Table 1, the subjects Included three groups of school 
children who differed In level of attalnnent In reading as estimated by the 



TABLE 1: Estimated mean reading grade,* mean age, and IQ+ for second- 
grade school children grouped according to reading attainment. 



Group 


n 


age 


IQ 


Reading Grade 


Superior 


17 


8.0 


113.9 


4.9 


Harglnal 


16 


8.1 


101.7 


2.5 


Poor 


13 


8.2 


111.6 


2.0 



*Readlng grade equivalent score on reading subtest of the Wide Range 
Achievement Test. 

'^Peabody Picture Vocabulary Test. 



word-recognition subtest of the Wide Range Achievement Test (WRAT). All were 
nearing completion of the second grade at the time the test J were conducted. 
There was no overlap In WRAT scores among the three groups. %e first group, 
designated as the superior readers, comprised 17 children who were reading weU 
ahead of their grade placement; they scored a mean grade equlvcleit of 4.9 on 
the WRAT. The second group, whom we call marginal readers. Included 16 children 
vho averaged slightly less than one half year of lag In reading achievement 
(grade 2.5). The third group, 13 children whom we call poor readers, obtained a 
mean WRAT equivalent of 2.0, Indicating nearly a full year of retardation In 
reading. The three groups did not differ significantly In mean age. Their In- 
telligence level, as measured by the Peabody Picture Vocabulary Test, was closely 
matched In the two extreme groups, the superior and poor readers. The differ- 
ence In IQ level In the marginal group Is apparently of no serious consequence 
since, as will be seen below, the performances of the marginal and poor groups 
on the experimental tasks were not appreciably different from each other. 

In Figure 2, which displays the data in terms of mean errors swamed over 
all serial positions In the letter strings, the upper plot gives the results for 
superior readers, while the middle and lower plots show the results for the 
marginal and poor readers, respectively. We see at once that the main differ- 
ences are between the superior readers and the other two groups. It was found. 
In fact, that the marginal and poor readers did not differ significantly in 
thai! overall performance. For this reason, we need not consider them separately 
here and will refer to them collectively Instead as the "Inferior" readers. 
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Figure 2: Mean recall errors susned over serial position. 
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It l8 laaftdlJitely apparent that tha auparlor group tends, overall, to make 
fewer errors in recall than the Inferior readers. Hore notable, hovsver* are 
the differences in the effects of phonetic similarity on the recall performance 
of the two reading groups. First, we see that though phonetic similarity caused 
some deterioration In lamediate recall for all the children, the effect was much 
greater for the superior group than for the inferior readers. Second, the dif- 
ferential effect of phonetic similarity is even more marked 1»» the delay condi- 
tion. For the superior group, the interposition of a delay Interval steeply In- 
creased errors of recall of the phonetically confussble strings but produced no 
effect on the recall of nonconf usable strings. We may suppose that in this 
group the phonetic similarity of the confuaable strings caused interference with 
rehearsal during the delay interval. For the inferior readers, on the other 
hand, there is no such interaction; delay depressed their performances on both 
conf usable and nonconf usable strings by nearly equal smounta. 

The differential effect of phonetic similarity on the superior readers is 
again apparent in Figure 3, where the data are replotted aa a function of serial 
position. An examination of the two graphs in the lower half of the figure 
shows that, after delay, the superior readers are sharply distinguished from the 
inferior groups In their better recall of nonconf usable strings, but are nearly 
indiatinguishable from the others in their recall of confusal^le strings. Taken 
together, the two lower graphs make manifest the much greatet penal effect of 
phonetic conf usability on the superior readers. The same differentially penal 
effect on this group is found also in the case of Immediate recall, aa aeen in 
the upper graphs of Figure 3, but there the difference is less striking.-' 

In suoaary, then, the superior readers are strongly penalized by the pho- 
netic similarity of the confuaable atrings of lettera. The penalty is apparent 
in innediate recall and more marked in the delay condition. We conclude from 
these findings that the superior group is using a phonetic code in short-term 
memory. This is not to say, however, that the inferior readers are not receding 
phonetically at all. Phonetic similarity does impair their performance some- 
what, though the effect is clearly less than for the superior group. There may 
be several Jnterpretations of this difference between the two reading groups in 
our study. One possibility is that the inferior readera rely leaa on phonetic 
receding than the superior group and concurrently uae other codes (visual codes, 
for exsmple) , which are unaffected by phonetic confusability. Another possibil- 
ity, suggested by Crowder (personal cosnunication) , la that they may alsply re- 
hearse at a Blower rate than the auperior readers, thereby giving the confussble 



An analysis of variance performed on the data ahowed all main ef fecta to be 
significant at p < .001 (Reading Group: F2, 43 - 22.67; Delay: Fi 43-29.77; 
Confusability: Fi 43 • 73.00). (The significance of the Reading Croup factor 
is accounted for by the differences between the superior readers and the other 
two groups; the marginal and poor readers do not differ significantly from each 
other in recall.) The three-way Interaction, Reading Group X Delay X Confua- 
ability, is statiatically aignif leant at p < .001 (F2^ 43 - 8.24). Newman- 
Kuela poat-hoc meana teata reveal that for the auperior readera, delay haa a 
significantly greater effect on recall of confuaable aequencea than on recall 
of nonconf usable sequences. Among the marginal and poor resdera, on the other 
hand, delay did not differentially affect performance on the two types of 
sequences* 
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Items less opportunity to interfere. Wiatever interpretation is accepted (and 
the answers ntust await further investigation), ve would emphasize that the 
failure of the superior readers to maintain their advantage over the Inferior 
group in short-term memory when the items are phonetically confusable cannot be 
accounted for by assuming that the groups differ only with respect to a general 
memory capacity. 

An auditory analog of our experiment would be one way to clarify the nature 
of the difference in short- tern memory between the two groups of readers.^ 
Since phonetic coding, as we said earlier, presumably cannot be avoided when the 
linguistic material arrives auditorily, auditory presentation should force the 
inferior reader into a phonetic i»3de. If an important component of his diffi- 
culty is that he is deficient in recoding visual symbolic material into phonetic 
form, then the phonetic similarity of auditorily presented stimuli should affect 
him as much (or as little) as it does the superior readers, vniile quantitative 
differences in memory capacity between the two groups may still show up in the 
general level of recall on the auditory presentation, the interaction of reading 
group and phonetic confusability should be diminished. If, on the other hand, 
the interpretation that implicates differences in rate of rehearsal between the 
groups is correct, the interaction should remain. 

Obviously, many other refinements of the experimental task remain to be 
made. In particular, we hope in the future to use tasks that resemble more 
closely what happens in actual reading. At the very least, we should like to 
repeat the kind of experiment reported here, using words Instead of letters. 
Only after that could we have a very high degree of confidence in the conclu- 
sion that seems to be suggested by the results of the present experiment— namely, 
that phonetic recoding is characteristic of skilled reading. 

SUMMARY 

By converting print to speech the beginning reader gains two advantages: 
he can read words he has never seen before, and he can, as he reads, fully ex- 
ploit the primary language processes of which he is already master. If he is 
to realize the first advantage, he must make the conversion analytically, not by 
whole words. That analytic conversion requires, in particular, an explicit 
awareness that speech can be segmented into units of phonemic size. Given what 
we know about the relation of speech sounds to phonetic structure,- we can see 
why explicit segmentation might be hard to achieve. Our recent research has 
shown that for young children such explicit segmentation is, in fact, difficult 
(more difficult in any case than segmentation into syllables) and that such 
difficulty may be related to success, or the lack of it. In the early stages of 
reading . 

Among the primary language processes that the child can exploit by conver- 
sion to speech (either analytically or holistically) is the use of a phonetic 
representation to store smaller segments (words, for example) until the meaning 
of larger segments (phrases or sentences) can be extracted. Research on speech 



Since the auditory experiment would, of course, necessitate serial presentation, 
an additional visual condition, employing serial presentation, would be re- 
quired to achieve comparability. 
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perception suggeeta that the phonetic repre8entat5.on nay be uniquely suited to 
such storage. That the phonetic representation la. In £act» so suited Is sug- 
gested by the outcoase of many experiments on short-term memory. Now we have 
evidence from a similar experiment that, suung second graders, good readers rely 
more on a phonetic representation than poor readers do. 
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Word Recall in Aphasia* 
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ABSTRACT 

Aphaaic and normal subjects listened to lists of ten words in a 
probe recall paradigm. Memory function was assessed by estimating 
the probabilities of recalling a word from either long-term store or 
short-term store. Uhen compared to the normal subjects, nine of the 
ten aphasic subjects showed deficient ability to recall a word from 
short-term store* and no capability to recall from long-term store. 
The memory functions of the remaining aphasic subject were anomalous: 
he showed no ability to recall words from short-term store, but an 
increased ability to recall from long-term store. 

INTRODUCTION 

Two common symptoms observed in patients diagnosed as aphasic are a greatly 
reduced vocabulary and difficulty in recalling strings of digits or words. To a 
psychologist these symptoms might indicate abnormal ii»mory processes. Models of 
memory usually consist of the two components, long-term store (LTS) and short- 
term store (STS), where LTS is the permanent information store and STS retains 
briefly a small number of items (Waugh and Norman, 1965; Atkinson and Shiffrin, 
1968) . Accordingly, reduced vocabulary could be viewed as a problem in access- 
ing end retrieving words from LTS, and shortened auditory retention span might 
reflect deficient STS function. 



*A shorter version of this paper was presented at the 87th meeting of the 
Acoustical Society of America, New York, April 1974. 

*Al80 City University of New York. 
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Only a few experiments have been conducted to assess nenory function In 
aphaslc patients. This Is not surprising since aphasia Is defined as a language 
disorder, and although nonaal language constunlcatlon must depend on normal oemr 
ory function, the relationship between the two has rarely been discussed (cf . 
Norman, 1972; Aaronson, 197A). Nonetheless, In a study of patients with differ- 
ent neurologlcally based language disorders, Halpem, Darley, and Brown (1973) 
found that out of ten language functions, aphaslcs performed poorest on Adequacy 
(word-finding difficulties) and Auditory Retention Span. 

Several studies have compared visual versus auditory SIS functions with 
some aphaslc patients. Lurla, Sokolov, and Kllmkowskl (1967) studied two aphas- 
lcs whose main symptom was the inability to repeat a series of auditorily pre- 
sented words (acoustic-mnestic aphasia). They showed that difficulty in recall- 
ing a series of three to five words was specific to the auditory modality. They 
did not discuss a two-component memory model. Butters, Samuels, Goodglass, and 
Brody (1970) tested groups of brain-damaged patients, soiae of whom were aphaslc, 
on recall of consonants presented visually or auditorily. A Peterson and 
Peterson (1959) paradigm was employed to test lumediate and delayed recall for 
either single consonants or consonant trigrams. Patients with left-hemisphere, 
parietal brain damage (all eight were aphaslc) had memory deficits in both vis- 
ual and auditory tasks. Patients with left-hemisphere, frontal brain damage 
(seven aphaslc, one nonaphasic) were thought to have no memory deficits, but 
rather an impairment in registration of the consonants. They concluded that 
"apparently, aphaslc and n^mory disorders represent separate and independent 
processes" (p. 457). It is not clear that their own results fully support this 
generalization. In addition, they have not taken into account the fact that in 
normal subjects visually presented consonants are often encoded in a phonolog- 
ical form in immediate memory (Conrad, 1964; Wickelgren, 1965; Conrad, 1972), an 
effect which may have confounded their results (cf. Warrington and Shallice, 
1972) . 

Two other studies have compared memory processes between aphaslc and normal 
subjects. In one of those (Carson, Carson, and Tikofsky, 1968) only quantita- 
tive differences were found between aphaslc and normal subjects In several 
learning tasks Including a verbal serial learning task. In the other study 
(Swlnney and Taylor, 1971) both quantitative and qualitative differences were 
observed in a nonverbal task examining the search process in STS. 

A series of memory experiments for both STS and LTS have been conducted by 
Wart'lngton and her colleagues using primarily one patient (KF) thought to have 
conduction aphasia, and whose main symptom was the inability to repeat a series 
of words (Warrington and Shallice, 1969; Warrington, Logue, and Pratt, 1971; 
Warrington and Shallice, 1972; Warrington and Weiskrantz, 1973). The main out- 
come of the research on KF is his selective impairment of auditory versus visual 
STS, and selective Impairment of LTS versus STS. In another study of one con- 
ductive aphaslc, Strub and Gardner (1974) accounted for the repetition difficul- 
ties primarily as a result of a linguistic-phonological deficit rather than a 
memory dysfunction. 

Ine present study con^ared both long-term and short-term memory func'^fons 
for groups of normal and aphaslc subjects. We chose a probe-recall paradigm 
based on the work of Waugh and Norman (1965) in which separate functions for LTS 
and STS are derived from one set of data. The procedure was to present tape- 
recorded word lists, with each list followed by a tone and a word drawn from the 
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Il8t~»the *'probe %rord." The probe word occurred In various positions on the 
lists and the subject ^s task was to recall the word following the probe word* 
Previous experiments have reported right-ear superiority In recalling words pre- 
sented monaurally (Bakkert 1969; Turvey» Plsonl> and Croog» 1972). Thu3» the 
word lists were presented n^naurally to test for possible ear differences In 
this experiment. 

METHOD 

Word Lists 

In principle, the word lists were constructed to test the memory of an 
aphaslc subject, not his difficulties In word usage* It has been reported that 
the variables of frequency of occurrence, part of speech, and abstractness 
(aisong others) do affect aphaslc patients* use of words (cf* Halpera, 1972). 
With this In mind, the words chosen for this experiment were selected from the 
Thomdlke and Lorge (1944) count of the 1000 moat frequent words In English, ex- 
cluding proper nouns and function words of three letters or less* Since high 
frequency words are easiest for aphaslcs to use (Schuell, Jenkins^ and Landls, 
1961) and many of the other variables correlate with word frequency, the words 
selected should present minimal difficulty for the subjects* Proper nouns and 
short function words were excluded for several reasons, one of which was that 
they tended to stand out In the word lists* 

Thirty word lists of ten words each were constructed, words were selected 
In a quasi-random way, and only a few words were repeated across lists. Probe 
words were chosen so that positions 2, 4, and 6 were probed twice for each ear 
and positions 7, 8, and 9 were probed three times for each ear, following the 
procedures of Klntsch and Buschke (1969) and Turvey et al« (1972). Word lists 
were presented 15 times each to the left and right ears, with ear presentation 
alternated randomly. 

The lists were read by the experimenter In a quiet room (lAC 1201) and re- 
corded on a Uher 4200 tape recorder. The word lists were read at a rate of 
3 sec/word, followed by a brief 1200 Hz tone and the probe word. Each list was 
read Into only one channel of the tape recorder, the channel assignments alter- 
nating randomly. There was a 20 sec response Interval between lists. The en- 
tire test tape lasted 30 minutes. 

Subjects 

Ten aphaslc men (aphaslcs) and another group of eight men matched for age 
and education (normals) served as subjects. The aphaslcs were all patients In 
the Speech Pathology and Audlology Services clinic of Northport Veterans Admin- 
istration Hospital, New York. The aphaslcs were Judged to have mild to moderate 
aphasia as tested on the Short Examination for Aphasia (Schuell, 1957) and the 
Porch Index of Communicative Ability (Porch, 1967). Etiologies Included both 
trauma and cerebral vascular accidents occurring from 6 months to 19 years prior 
to testing. In all cases there were symptoms Indicating brain damage to the 
left hemisphere and In a few cases to both hemispheres. Their ages ranged from 
26 to 56 years (mean 47.6 years) and all were right handed. Education levels 
achieved Included eighth grade (n-4), high school diploma (n*5), and college 
diploma (n»l). All patients had audiograms that were nonoal In both ears for 
their ages. 



127 



123 



The normals were all veterans who volunteered their time £or the expert-^ 
ment. They all appeared to have normal language function and had no known hear** 
Ing difficulties* They ranged in age from 41 to 65 (mean ■ 60.6 years) and were 
all right handed « The education levels ar!.ieved included eighth grade (n«3)» 
high school diploma (n<"3) » and college diploma (n»2) . 

Procedure 

Ttie procedure was the same for both groups. The subjects were verbally in- 
structed to report the word on the list that followed the probe word. A few 
practice word lists were read until the experimenter was satisfied that the sub- 
ject understood the instructions or he was eliminated from the experiment. The 
stibjects were told that the task was very difficult » but that they were to think 
about eich word as they heard it and not to go over previous words* The experi- 
menter recorded the verbal response of the aphasics whereas the normals wrote 
down the responses themselves. No particular difficulty was encountered in un- 
derstanding the responses spoken by this group of aphasic patients because of 
their moderate impairment. 

The tape-recorded lists were played on a Uher 4200 through Grason-Stadler 
TDH-39 earphones. The playback channels were equated for equal intensity and 
presented to subjects at a comfortable listening level. 

RESULTS 

The results were first tallied by the nuoiber of correct responses for each 
ear for each subject. Unfortunately* no differences in recall were obtained 
from right* versus left-ear presentations for either normals or aphasics. In- 
deed, replication of the Turvey et al. (1972) probe recall experis^nt for nor- 
mals now seems in doubt (Turvey, personal communication). Therefore, the data 
presented in this paper combine the results from both ears. 

Examination of the overall pattern of correct response by probe position 
yielded similar functions for all subjects but one, aphasic CH. His data are 
reserved for later and the functions showing probability of recall for the re- 
maining subjects are graphed In Figure 1. For both groups of subjects the prob- 
ability of recall is low and constant for the words near the beginning of the 
list and increases rapidly for the last three items. 

According to the Waugh and Norman (1965) model, the probability of a word 
entering secondary memory or long-term store, P(LTS), is constant over all probe 
positions if rehearsal of each item is constant (as the subjects were so in- 
structed) . On the other hand» the probability of an item being retained in pri- 
mary memory or short-term store, P(STSj^), is greatest for the most recently pre- 
sented word and decreases monotonically for preceding words until it reaches 
zero when, presumably, the limited number of words stored in STS is exceeded. 
Assuming that the probabilities of a word being in LTS or STS are stochastically 
independent, then the probability of recalling a word at probe position 1 is: 
P(rj^) - P(LTS) + P(STSj^)-P(STS^)P(LTS). P(LTS) can be estimated over the con- 
stant Portion of the recall functions by taking the mean of the recall probabil- 
ities at positions 2, 4, and 6. P(STSj^) is then calculated from the equation 
above. 
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Figure 2 displays the estimated probabilities for both groups of subjects. 
The most striking feature of Figure 2 is that the probability of recalling a 
word froa LIS for the aphaslcs la zero* The P(LTS) for normals («11) Is In 
range of those reported in Waugh and Norman (1965). The estimated probabilities 
for STS produced similar recency cc»ponents for normals and aphasics» although 
the P(STS^) is depressed by about .17 at positions 7, 8, and 9 for the aphaslcs* 

Figure 3 presents the recall function for aphaalc subject CR* CH's func- 
tion contrasts sharply with the normals and aphaslcs In Figure 1* The probabil- 
ity of recall at all probe positions Is nearly a constant .5 with no recency 
effect In the final items* In terms of memory modelSt CH has a high probability 
of recalling an item from long-term store > P(LTS) .AAi and no evidence of re- 
calling an item from short-term store* 

DISCUSSION 

A number of possible disorders of memory function are implied by the re- 
sults of this experiment* The majority of the aphaslcs studied (nine of ten) 
retained the same number of words In short-term store as the normal subjects* 
However t the aphaslcs were unable to recall the words correctly from STS as 
often as the normals* It should be noted that these statements cannot be gener- 
alized since only mild to moderate aphaslcs participated in this experiment* 
For more severely Impaired aphaslcs the instructions were too difficult to com- 
prehend and presumably their v^ory processes might also be more Impaired than 
those tested here* (Three moderately Impaired aphaslcs were unable to compre- 
hend the instructions*) 

The results for STS agree with those obtained from the visual t serial 
search task of Swlnney and Taylor (1971)* Although some aphaslcs were unable to 
perform in their task^ those who did were characterized as using a serlitl search 
process similar to normals* but more slowly and with more errors* These results 
are also in accord with the data shown by Carson et al* (1968:98)* In the 
serial position curves obtained as an average of ten rote serial learning trials * 
the recency components for both aphaslcs and normals extend over the same number 
of items* However t the probability of recalling xcems is depressed for the 
aphasics* 

In the probe recall experiment » nine out of ten aphasics were incapable of 
recalling words from long-term store* The conclusions of Carson et al* (1968: 
110) that aphasics learned tasks slowly and "demonstrated limited retention and 
transfer of learning in general" might be related to deficient recall of mater- 
ial from LTS* The question arises > however » as to what extent the disorder ob- 
served is caused by the registration versus the retrieval of words from LTS* 
Analysis of the errors in this experiment showed that aphasics responded occa- 
sionally with words in early positions in the word lists (numbers 1 to 5) as 
well as with words from previous lists* Apparently » some words were registered 
in LTS* Clearly further research directed at the nature of retrieval of items 
from LTS should be undertaken* 

Ue can conclude that the majority of aphaslcs demonstrated memory disorders 
characterized by reduced short-term store function and an absence of long-term 
store function* On the other hand^ one aphasic subject apparently had a comple- 
mentary disorder— no STS function and a heightened LTS function* I have no rea- 
son to believe that CH's results are due to any artifacts surrounding his testing 
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or that he acted on different instructions than those given. (In informal test- 
ing with a graduate student I was unable to find a test strategy that could dup- 
licate CH's results.) 

Ve might be tempted to treat CH as an anomaly if it were not for the fol- 
lowing studies. The study of two patients by Luria et al. (1967) is not di- 
rectly comparable to the present study, but the patients did show strikingly 
different patterns of responses in auditory short-term memory tasks. Warrington 
et al. (1971) investigated three patients diagnosed as conductive aphasics* 
They concluded that these patients had relatively normal LTS function and se- 
verely impaired STS function for the auditory modality only. Strub and Gardner 
(197A) confirmed Warrington* s results for another conductive aphasic. CH's re- 
sults closely match those for the conductive aphasics except for the surprising- 
ly high probability of recall from LTS. Unfortunately, CH stopped coming to the 
clinic and was not available for further testing. 

We are thus left with the conclusion that aphasics differ from normals in 
both STS and LTS function, and further, that aphasics with different linguistic 
(and presumably neurological) deficits may have totally different memory dis- 
orders for auditorily presented words. We hope further research will clarify 
these statements and. in particular, incorporate the often observed memory dif- 
ferences for material presented in the auditory and visual modalities. 
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Linguistic and Nonllnguiatlc Stimulus Dimensions Interact in Audition but not 
In Vision 

Jaros E. Cutting 

Haskins Laboratories, Mew Haven, Conn. 



During the processing o£ speech stisulit irrelevant variation In 
fundaiaental frequency Impedes the phonetic decision-making process, 
but irrelevant phonetic variation does not Impede pitch decisions. 
Mo analogous asymmetry is found In vision: the processing o£ a 
letter of the alphabet is not Impeded by irrelevant variation In type 
font, nor is the processing of type font Impeded by irrelevant varia- 
tion In letter. The differential results can be Interpretr^d in terms 
of compulsory (in audition) versus optional (in vision) processing 
of linguistic features. 

Gamer (1974; Gamer and Felfoldy, 1970) has employed tvo-choice speeded 
classification tasks to study patterns of interaction between stimulus dimen- 
sions. Often in card sorting tasli:s stimulus dimensions either mutually inter- 
fere or they do not interfere with one another during processing. Integral 
dimensions, such as value and chroma in Hunsell color chips, produce Interference 
when subjects are required to attend to orthogonal stimulus dimensions (Gamer, 
Hake, and Eriksen, 1956). Separable dimensiona, such as size and angle of dot 
arrays, do not interfere. 

Day and Wood (1972a) and Wood (1973, 197A) found that linguistic and non- 
linguistic dimensions of auditory stimuli were neither Integral nor separable. 
There was some integrality (as seen In interference and In Increased sorting 
time) but Integrality was asymnetric. Likewise there was some separability (as 
seen in lack of Interference and no Increase in sorting time) but separability 
was also asynnetrlc. The linguistic dimension was place of articulation— [ba] 
versus [da] (Day and Wood, 1972a) or [bse ] versus [gae ] (Wood, 1973, 1974)— and 
the nonlinguistic dimension was fundamental frequency, or pitch — high (140 Hz) 
versus low (104 Hz), used In all three studies. Since it is not possible to 
mount auditory atlmull on cards these studies used a discrete reaction time 
paradigm. 

Four tasks are of Interest here. Decisions are made between (1) [ba] versus 
[da], for example, with no pitch variation within a task; (2) high pitch versus 
low pitch, with no phonetic variation within a task; (3) [ba] versus [da], with 
pitch varying randomly from trial to trial within a task; and (4) high versus 
low, with the phoneme varying randomly from trial to trial within a task. The 
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£lrst two have been called control tasks, since only one dljaenslon varies, where- 
as the last two have been called orthogonal tasks, since the nontarget dimension 
varies In an uncorrelated fashion. Reaction tines are cooparable for tasks 1, 
2, and 3, but task 4 shows a narked Increase In reaction tine. That Is, adding 
Irrelevant phonetic Infomatlon to the stloull when naklng a pitch judgment has 
little or no effect on decision tine, but adding Irrelevant pitch infomatlon to 
the stlmill when naklng phonetic judgments Increases reaction tine by 8 percent 
(Day and Wood, 1972a), 12 percent (Wood. 1973), or even 14 percent (Wood, 1974). 
Such Interactions appear to occur in audition only when 'one dimension is linguis- 
tic and the other Is nonllngulstlc. Two linguistic dlaenslons, such as conso- 
nants and vowels, yield Integral or mutually interfering results (Day and Wood, 
1972b), and two nonllngulstlc dimensions, such as pitch and intensity, yield an 
integral pattern as well (Wood, 1973) . 

The present study was designed to detemine if linguistic and nonllngulstlc 
dimensions in visual stimuli would Interact In a similar fashion. Lower-case 
letters b and d w^re used at different thicknesses. 



Eight decks of 32 cards each were prepared. Cards were made of cardboard, 
2.5 X 3.5 Inches with the upper left-hand comer clipped off in a manner similar 
to standard computer cards. Mounted on each card 1 inch down from the top and 
1.25 Inches from either aide waa the lower case b or d. Each was a 16 point 
Letraset Avant Garde Gothic press-on character in either a Medium or Bold font. 
The front surface of each card was then sealed in plastic. 

Four decks were used in control tasks: Medium font b versus d. Bold font b 
versus d. Medium font versus Bold font b, and Medium font'Versus Bold font d. 
Every control deck contained 16 cards each of the two different categories to be 
sorted. The four remaining decks were used in orthogonal tasks. Each of these 
decks consisted of 8 Medium font b cards, 8 Bold font b cards, 8 Medium font d. 
cards, and 8 Bold font d^ cards. "* ~ 

Twenty-four Yale University undergraduate and graduate students sorted each 
of the eight decks four times. Cards were sorted into two piles according to 
task instructions, either by letter or by thickness. The order in which sub- 
jects sorted the cards was determined by a balanced design. 



There were no asynmetrles of sorting times, as shown in Table 1. Adding 
the Irrelevant dimension of thickness Increased sorting time by letter by only 



TABLE 1: Mean sorting times, in seconds, for the four experimental conditions. 



Method 



Results 



Stimulus Dimension 



Condition 



Control 



Orthogonal 



Letter 



14.7 



14.9 



Thickness 



14.4 



14.8 
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1*4 percent » a nonalgnlflcant Increment • Adding the Irrelevant dimension of 
letter Increased sorting time according to thickness by only 2.7 percent, also 
nonsignificant* These resiilts are similar to those found in a pilot st\idy when 
sorting written versions of ba and da> printed in italics and standard type. 

Diacussion 

The letters and d, when pronounced and transcrlbedy correspond to [bl] 
and [dl]« These items differ only In vowel from the Day and Wood stimuli [ba] 
and [da]« The thicknesses of the letters, in Medium and Bold Avant Garde Goth;,.c 
fonts » are roughly analogous to the dimension of pitch in the speech stimuli* 
Just as there can be no letters without thickness, there can be no speech stimu- 
li without an excitation, normally the fundamental frequency (pitch). Thlck-^ 
nesses and pitches can vary within a wide range without decreasing Identlflabil-* 
ity of the letter or phoneme, but such variation does not change the linguistic 
message* Thus, the letters b^ and d in two different fonts would appear to be an 
appropriate analog to [ba] and [da] at two different pitches. Why then are the 
results not analogous? 

While the paradigm in the preu ^t sturdy differs from that of Day and Wood 
(1972a, 1972b) and Wood (1973, 1974), thare is no logical reason for suspecting 
this difference to contribute to the different results. Both paradigms yield 
differ ances thought to reflect differential processing. Similarly, the lack of 
differences is thought to imply tho lack of differential processing difficulty. 

A more plausible explanatlca is that the difference between the visual and 
auditory results is caused by che nature of the stimulus dimensions. There is 
no question that for English speaking subjects the dimension of place of articu- 
lation, [ba] versus [d^ij, is linguistic and that the dimension of pitch, high 
versus low, is nonllrguistlc. In the visual stimuli of the present study, the 
dimension of thickness is certainly nonlinguistic, but perhaps the dimension of 
letter is not strlc/;ly linguistic. Perhaps subjects dismiss the pronunciations 
of [bl] and [dl], aid target for nonlinguistic form. Indeed many subjects in 
the present study volunteered that they merely looked for the loop at the lower 
end of the letter ar<l sorted according to which side ot the bar the loop was 
located. If Indeed th{^ letters v^re treated as different forms without refer- 
ence to their pronunciations, the letter: dimension is Ju&t as nonlinguistic as 
the dimension of thickness. 

More broadly, then, a linguistic dimension in a visual pattern can be 
treated as language, as in the reading process, or it may be treated merely as 
nonlinguistic form. However, linguistic dimensions in an auditory pattern, at 
least those distinguishing stop consonants, cannot be dismissed as nonlinguistic 
form. It appears that linguistic analysia of speech is in some aense compulaory 
and dependent on prior — or parallel (Wood, 197A) — analysis of acoustic form. 
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Laryngeal Muscle Activity, Subglottal Air Pressure, and the Control of Pitch 
in Speech 

Rene Collier"*^ 

Haskiua Laboratories, New Haven, Conn* 



ABSTRACT 

An experiment was performed to assess the degree to which laryn- 
geal muscle activity and subglottal air pressure affect the rate of 
vocal cord vibration in speech. Attention was limited to those 
change t in the rate of vocal cord vibration that are associated with 
the articulatory iiuplementation of prosodic categories such as into- 
nation and prominence. Subglottal air pressure , was measured directly 
through a catheter inserted between the cricoid and thyroid carti- 
lages. Using hooked-wire electrodes, the electromyographic activity 
was recorded in the right and left cricothyroid muscles and in the 
sternohyoid, sternothyroid, and thyrohyoid imiscles. The data were 
collected for one speaker of Dutch. The results of the experiment 
show that, in this speaker, (1) cricothyroid muscle activity bears 
the most direct relationship to all the major fundamental frequency 
(Fq) changes: contraction of that muscle raises Fq while its relaxa<- 
tion has a Fq lowering effect; (2) subglottal air pressure controls 
the gradually falling base line of the Fq contour and gives support 
to a rapid Fq drop if it occurs on the utterance-^final syllable; and 
(3) the sternohyoid, sternothyroid, and thyrohyoid muscles have no 
systematic effect on Fg* 

INTRODUCTION 

The research reported in this paper concerns the general issue of pitch 
control in speech, (tore specifically, it is Intended to clarify further the 
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relative Importance of laryngeal and respiratory maneuvers In varying the rate 
of vocal cord vibration* We also focus attention upon the artlculatory Imple*- 
mentation of linguistic categories that are related to Intonation and promi- 
nence • 

The experiment Included slmriltaneous recordings of subglottal air pressure 
ana of electromyographic (EtSS) activity In those laryngeal muscles that are 
usually assumed to participate In the control of fundamental frequency. The 
data were obtained for one subject, a native speaker of Dutch. Dutch vas chosen 
because the Intonatlonal structure of the language Is rather well understood. In 
both Its acoustic and perceptual aspects. By Including a large variety of pitch 
contours In the speech materials of the experiment we hoped to extend the range 
of observations beyond the simple dichotomy of **f ailing*' versus **rl8lng'* pitch 
contours • 

EXPERIMENTAL PROCEDURES 

The perceptual experiments reported by Cohen and *t Hart (1967), Collier 
(1972), Collier and 't Hart (1972), and Hart and Cohen (1973) have resulted 
in a fairly complete Inventory of the pitch contours that are acceptable In 
Dutch, together with a specification of their Internal structure and of the de- 
gree of perceptual resemblance among them. Figure 1 presents a number of those 
contours In a stylised form. Each of them can be considered as a particular se« 
quence of transitions between a low and a high pitch level. Those two reference 
levels can be approximated by parallel lines of gradually downward drifting 
pitch, the so-called high and low "declination line." The declination line 
effectively ftinctlons as a link between successive pitch movements, and Is audi- 
ble as such on all those syllables that do not carry a major change In pitch. 
The rate of declination may vary and Individual pitch movements may overshoot or 
undershoot the declination line. 

During the experiment all the contours of Figure 1 were spoken by the 
author » who Is a native speaker of Dutch. He read lists of randomized utter- 
ances In which the word content was Identical and the pitch contours were 
varied. Each contour type was repeated between 20 and 30 times • The utterances 
chosen were: "Heleen wll die kleren meenemen" [he.le.n wll dl kle:r8 me.ne.ma] 
(Helen wants to take those clothes along); and "Heleen*' [he.le.n] (Helen). The 
longer utterance was chosen according to the following criteria: meaningful 
Dutch, no open vowels (because Jaw opening may Influence the pattern of sterno- 
hyoid activity), a maximum of voiced segments (In order to obtain a continuous 
fundamental frequency curve), one voiceless segment (to facilitate the location 
of a llne-up point for averaging the repetitions of each contour type), and at 
least three potentially prominent syllables. 

Our Intention was to sample EMG data for three Intrinsic and three extrin- 
sic larjmgeal muscles and to record subglottal air pressure at the same time. 
The muscles chosen were: cricothyroid (CT), lateral cricoarytenoid (LCA), 
vocalls (VOC), sternohyoid (SH), sternothyroid (ST), and thyrohyoid (TH)« 
Hooked-wire electrodes were Inserted percutaneously into each of these muscles, 
following the techniques described by Hlrose (1971). Subglottal air pressure 
(Pg) was measured directly: a plastic tube (0«035 Inch Inside diameter, 1 1/2 
inches long) was placed around an 18-gauge steel needle and Inserted through the 
cricothyroid membrane. When the needle was withdrawn the plastic tube was left 
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Figure 1: Stylized shape of the pitch contours used in the experiment 
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in place. It was then coupled to a second tube (3/16 Inch inside dlameteri 
3 Inches long) and connected to a pressure transducer (Setra Sys terns » tfodel 
236L) . 

The EMG and pressure signals were directed to differential aioplifiers, then 
to distribution aiaplifiers* The physiological signals^ together with the audio 
signal and timing pulses, were recorded on a lA-channel instrumentation recorder 
(Consolidated Electrodynamics VR-3300) * The visual editing of the raw data and 
their computer processing were performed on the Hasklns Laboratories* VSfG data-* 
processing system, following the procedures described by Fort (1971) and Kewley- 
Port (1973, 1974). The and EMG data were processed simultaneously. Before 
averaging, the physiological signals were optionally smoothed using an Integra*** 
tion time constant of 75 msec* In order to Increase the accuracy of comparative 
timing measuren^nts, the data were also processed with a 20 msec time constant. 
Fq was measured in selected tokens of each contour type using a computer-imple-" 
mented adaptive autocorrelation method designed by Lukatela (1973) • 

RESULTS 

Inspection of the processed data revealed that the electrode insertion into 
the VOC apparently had not reached the Intended muscle but had been Inserted in- 
stead into the CT, since the target miscle did not show the expected contraction 
for swallowing, coughing, or glottal stop production but was active for singing 
ascending pitch scales* Figure 2 shows the great similarity of the two CT 
channels. In presenting the data, only results obtained for the rlght-^slde CT 
will be considered* Figure 2 also shows that the SH and ST muscles have a 
rather similar pattern of activity* This similarity was also observed by 
Atkinson (1973)* Therefore only the data on the SH will be presented* It is 
clear from Figure 3 that the TH muscle is not obviously related to Fq changes: 
ICS pattern of activity varies little as a function of differences in the Fq 
contours* TH will therefore not be considered in further presentations of the 
data* Since the insertion into the LCA deteriorated during the experiment, the 
data on this muscle could not be used* 

All the relevant data of the experiment are grouped in the Appendix* They 
are displayed in the following order (from top to bottom in each illustration): 
fundamental frequency, cricothyroid, sternohyoid, and subglottal pressure* In 
each graph the thick line represents the physiological signal averaged over 20 
to 30 repetitions of the same contour type* The thin line corresponds to the 
signal of the single token whose correlation with the averaged signal was found 
to be high on all channels (correlation coefficients ranging from rp « *f«85 to 
+.98)* The Fq curve at the top presents the pitch contour of that selected 
token* The longer utterances have been lined up with respect to the resumption 
of voicing after [k] in "Heleen wll die kleren meenemen;** the shorter utterances 
have been lined up with respect to the beginning of the second [e] in 'lleleen**' 
In all the illustrations the physiological data have been smoothed with an inte- 
gration time constant of 75 msec* 

In this section on Results we will examine the relationship between the 
various types of Fq change and each of the physiological variables* We will 
point out particular EMG and Pg features that co-occur with the major Fq 
changes, but will save the interpretation of the relative importance of these 
parameters for the Discussion section that follows* 
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Figure 2 : The EMG activity in five laryngeal muscles and the subglottal air 
pressure variation, averaged over 26 repetitions of Contour 4. 
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Figure 3: Thyrohyoid nuscle activity In four different contourc. 



142 



ERIC 



14{ 



Pitch Rises 



All pitch rises appear to be preceded by increased CT activity. The time 
delay between a peak in CT activity and the corresponding Fq inaxlmuiB» as 
measured in 30 individual tokens, averages 94 msec (s - 32). The positive cor- 
relation between increasing Fq and increasing CT activity appears to be good 
also when one examines the rate of increase and the magnitude o£ the Fq rise. 
Thus, if Fq rises abruptly, CT activity increases equally suddenly. Compare, 
for example, the first (sudden) and second (smooth) Fq rises in Contour 12. 
Also, if two Fq rises differ in magnitude, then the corresponding peaks In CT 
activity tend to differ accordingly. For example, compare the slightly differ- 
ent peak values of Fq in the three rises of Contour 2, or the largely different 
Fq peak values in Contours 10, 11, or 12. Contour 9, however, is different In 
this respect. The peak values of the second and third Fq rises are almost 
identical and yet the corresponding CT pe&ks are different. Apparently, In this 
Instance, where the third Fq rise occurs on the utterance-final syllable, the CT 
contracts more strongly to overcome the pitch-lowering effect of the simultane- 
ous, rapid drop. 

Usual-^y, when the CT is contracting before a rise in Fq, the SH appears to 
show strongly reduced or suppressed activity. Exuaples of this relationship can 
be seen in Contours 2, 4, and 6. It is also apparent from the comparison of the 
second halves of Contours 9 and 10. However, there are cases where the SH does 
have a major peak during the CT contraction. This is particularly true of the 
SH peak preceding the line-up point in Contours 3, 5, and 13. 

Most of the sudden Fq rises are roughly correlated with increases in Pg, 
but in many of those instances the peaks in Fq and P^ are not synchronous: the 
pressure peak precedes the Fq peak by 80 msec in Contour 14; by 75 and 65 msec 
in Contour 6; and by 55, 40, and 35 msec in Contour 2. In other instances Pg 
and Fq peaks are indeed synchronous, as in Contours 16, 18, and 19, all of which 
are short, bisyllabic utter^ii In Contour 9 the first Ps peak leads the Fq 

peak by some 50 msec, while th^ second Pg peak is synchronous with Fq. Pg does 
not increase if a Fq peak occurs late in the utterance-final syllable, as in 
Contours 9 and 19. Also, gradual increases in Fq are not invariably reflected 
in a smoothly rising Pa: in Contour 12 the correspondence is fairly good, but 
not in Contour 10. The increases in Pg, associated with rising Fq, range from 
0.5 to 2.0 cm aq while the Fq rises vary between 20 and 75 Hz. 

Pitch Falls 

Marked drops in Fq appear to be preceded by relaxation of the CT in all 
utterances: e.g., in Contours 1, 2, 4, and 6. Apparently differences in the 
rate of Fq decrease are also reflected in the rate of CT relaxation. In 
Contour 1 the first drop in Fq is more rapid than the second, and so is the 
pattern of decreasing CT activity. In Contour 9 the first Fq fall is more 
gradual than the second; the first Fq fall in Contour 11 is less steep than the 
first in Contour 12. All these differences are reflected in the corresponding 
patterns of CT relaxation. 

The onset of a Fq fall coincides with the beginning of increased SH activity, 
so that the contraction of the SH reaches a peak by the time the Fq fall is al- 
most half completed. These temporal relationships are illustrated in Contours 
2, 4, 9, and 16. 
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Mdst Fq falls are accompanied by a decrease in Pa* This can be observed 
Aether or not the Fq fall occurs near the end of the utterance. Thusi If a Fg 
fall occurs on a syllable that is not utterance*f inal (case At for exaople in 
Contour 3)> there is one decrease in associated with this Fg fall and a 
further one that is related to the end of phonation. If the final fall of a Fq 
contour does occur on the utterance-final syllable (case B» for exasq>le in 
Contour 16) » there is only one drop in that marks both the Fg fall and the 
end of phonation. In case A neither of the P3 drops excedes 2*5 cm aq* In case 
B the P3 decrease is at least 5 cm aq* In case A the last P3 decrease occurs 
30-^100 msec before the end of phonation, while in case B the only Pg drop begins 
200-300 msec before this point. 

High "Declination** 

Stretches of high, nearly constant Fg can be observed in Contours 3» 5, 13, 
and 14. During these portions the CT always shows continuing activity, while Pg 
is either constant or slowly falling. As long as Fg is high, the SH shows par«» 
tially reduced or suppressed activity. 

Low "Declination" 

Stretches of low, declining Fg can be observed in Contours 3» 4» and 6, and 
especially in Contours 7 and 8. In all these utterance portions the CT is com- 
pletely passive while the SH shows successive peaks of varying height. P3 is 
gradually falling and does so at a rate that matches the slowly falling Fg 
rather well. In these cases a 5 Hz decrease in Fg corresponds to a drop of 1 cm 
aq in Pg. 

DISCUSSION 

In the previous section we presented the experimental data, looked at the 
various types of Fg change, and tried to indicate which physiological variation 
was associated systematically with the Fg change. In the present section we 
approach the data from the opposite angle and examine each physiological param- 
eter with respect to its effect on Fg variation. The second part of the section 
deals with the relation between linguistic categories such as "breath group" and 
"prominence," and the articulatory o^chanisms that implement them. 

The Function of the Physiological Parameters 

The cricothyroid muscle . The pitch^raislng effect of CT contraction that 
we observe in our data has long been known. Ever since the early EM6 studies on 
humans by Katsuki (1950) and Faaborg-Andersen (1957) it has been repeatedly con- 
firmed that of all the Intrinsic laryngeal muscles, the CT shows the most direct 
relationship to increasing Fg, both in singing and in speech. As to the Fg- 
lowering effect of CT relaxation, Lieberman (1970) observes: "There is no a 
priori reason to assume that all phonetic features must be implemented by tens- 
ing a particular muscle.... It is therefore possible that abrupt falls in Fg 
could be Implemented by relaxing muscles that in their tensed state maintain 
higher Fg" (p. 199). Simada and Hirose (1970) and Sawashima, Kaklta* and Hlki 
(1973) found that a steep decrease in CT activity preceded the Fg drop associ- 
ated with the accent kernel in Japanese. Atkinson (1973), too, has presented 
many instances where Fg falls correlate with the decreasing activity of CT (and 
VOC and LCA) . 
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Our data also indicate that whenever Fq is high and level, the CT shows a 
pattern of continued activity* The amount of this activity may or may not be 
constant, so that prolonged CT contraction invariably correlates well with **high^* 
Fq, but not always with ''declining** Fq. Similarly, Sawashlma et al. (1973) 
briefly mention the presence of high and gradually decreasing CT activity during 
a so-called plateau in two of their Fq contours* 

In our data, there is only one kind of Fq change that bears no relationship 
to the CT: when Fq is generally low and falling very gradually (at a rate of 
some 15 H?7sec), the CT is not active and cannot be held responsible for the 
smoothly falling Fq* This can be seen in Contours 7 and 8. Apart from this ex- 
ceptioHv the CT always shows a very straightforward correlation with Fq; when 
one adjusts for the timing difference between the physiological and the acoustic 
event, the CT matches Fq changes with respect to their direction (rising or 
falling), their magnitude (small or large), and their rate (gradual or sudden)* 

The sternohyoid muscle Most researchers who have looked into the function 
of the SH In speech have argued that this muscle can participate in both seg- 
mental articulation and Fq control* As far as segmental articulation is con** 
cemed, Ohala and Hirose (1970) and Slmada and Hirose (1970) mention that SH 
contraction is associated with Jaw opening, tongue lowering, and tongue retrac-- 
tion, all of which require lowering or fixation of the hyoid bone* Ohala (1970), 
Girding, Fujlmura, and Hirose (1970), and Atkinson (1973) also observe peaks in 
SH activity Immadiately before the onset of phonation and assume that the SH 
helps in preparing the larynx for the **8peech mode**' Atkinson further finds SH 
activity during voiceless consonant closure and relates this to the resumption 
of phonation after the consonant* Our test utterances do not contain open vow-* 
els, but most of the SH peaks can indeed be traced back to tongue retraction for 
the release of [l,n,d] or to tongue lowering for the release of [k]* Our data 
also show a peak in SH activity before the onset of phonation* 

With respect to the participation of the SH in Fq control, a number of re* 
searchers have pointed out that the SH is active during the transition from high 
to low Fq, as well as during low, level Fq (Ohala, 1970; Atkinson, 1973; 
Sawashlma et al*, 1973)* Ohala (1972) and Kakita and Hiki (1974) have attempted 
to account mechanically for the effect of SH contraction on Fq* In our own 
opinion, however, there remain reasons why the SH cannot be considered the pri- 
mary effector of Fq lowering* The first reason is that the lack of timing dif- 
ference between the two variables makes a direct causal relationship unlikely* 
Since it takes time for a muscle to contract and become effective, one should 
expect SH acti'vlty to start well before the onset of the Fq drop* In our data 
(and in those of Atkinson, 1973, and Sawashlma et al*, 1973) SH contraction 
coincide s with the beginning of the Fq fall* The second reason is that SH ac- 
tivity is the same for abrupt changes from high to low Fq as for steady, low Fq* 
The third reason is the Imperfect reciprocity between the patterns of CT and SH 
activity. If the two muscles were antagonists, SH would relax whenever CT con- 
tracts* However, our data show many exceptions to uhis tendency (for example, 
in Contours 3, 5, and 13)* The fourth reason is that when one imitates a pitch 
contour by humming it (thus eliminating all segmental effects), there is almost 
no activity in the SH throughout the contour, not even for Fq falls as large as 
70 Hz* 

Subglottal air pressure * Studies such as those discussed in Ohala (1970) 
indicate that variations in Po can have an effect on the rate of vocal cord 
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vibration. Our own data show that In certain pitch contour types the overall 
high and relatively steady P3 level Is nodulated by fluctuations that roughly 
corre9pou<i to the rises and falls in Fq. Ladefoged (1962) presented EMG data 
that show increased activity in the internal Intercostal nusdes IsBsediately 
before accented syllables* Thus the momentary increases in Pg, associated with 
rising-falling Fq on accented syllables, may well be the result of active, Bvr 
piratory muscle control. Yet one must not overlook the possibility that the 
observed changes in also reflect the variations in glottal resistance. In- 
deed, some of the intrinsic laryngeal muscles, especially the CT, but also the 
VOC and LCA, have a pitch-dependent pattern of activity. Thus it Is conceivable 
that the synergetic, Fo~raislng contraction of these muscles results in a grad- 
ually stronger resistance to the flow of air from the lungs, so that Pg pas- 
sively Increases in the vicinity of an accented syllable carrying a Fq rise. 
The comparisons in Figure 4 show that the level of CT contraction may indeed ex- 
plain to some extent the modulations in the Pa curve. 

Whatever the origin of the P3 fluctuations may be, it appears that in our 
data they cannot be considered the prime cause of the major Fq changes. For one 
thing, they appear to be too small to account for the full extent *" the Fq 
changes, unless one accepts as normal a AFo/APg ratio as high as 4U;1. For an- 
other, the timing relationship between Pg and Fq is highly variable and the ex- 
pected synchrony of the peaks of the two variables is lacking in most of the 
cases. Finally, our data contain instances where Pg does not reflect the Fq 
changes at all. This is particularly true of the Fq rise in the second half 
of Contours 10 and 15, and of the Fq falls in Contour 14. 

Only two types of Fq c^ «uge in our data have a fairly good correlation with 
Ps* Subglottal pressure apparently controls the course of Fq during those por- 
tions of the contour where no major rises or falls occur. In such cases the 
AFo/APg ratio is approximately 5/1. This interpretation corresponds to that of 
Atkinson (1973), who states: "The steady or slightly falling pressure. . .seems 
capable of controlling to a large extent any steady or slightly falling Fq con- 
tour" (p. 117). In our description of Fq falls we noted that Pg decreases more 
strongly and earlier in the utterance-final syllables that carry the tezminal 
Fq fall of the contour than in those that do not. Thus, considering its magni- 
tude and timing, rapidly falling Pg may indeed control rapidly falling Fq in 
utterance-final syllables. Here the AFQ/APg ratio is 12/1. The difference be- 
tween these two cases is that in the former, the CT is completely passive and 
cannot influence Fq, while in the latter, the Fq fall can be related not only 
to falling Pg but also to decreasing CT activity. 

The Articulatory Implementation of Prosodic Categories 

The contours presented in Figure la are variants of the same intonation 
pattern (Cohen and 't Hart, 1967; 't Hart and Cohen, 1973). This pattern cor- 
respond", to the "unmarked breath group" [-B6] in the feature system of Lieberman 
(1967, 1970) and Lieberman, Sawashlma, Harris, and Gay (1970). The contours in 
Figure lb all differ from those in Figure la, and some, if not all, also differ 
from each other (Collier, 1972; Collier and 't Hart, 1972). In Figure lb only 
Contours 9 and 10 (and its variant Contour 13) correspond to Lieberman* s "marked 
breath group" [-l-BG]. In the feature system of Vanderslice and Ladefoged (1972) 
Contour 9 would be [+cadence, +endglide], while Contour 10 (or 13) would be 
[-cadence, -Fendgllde] . Contours 11, 12, 14, and 15 cannot be described in terms 
of either feature system. Let us therefore limit our attention Just to those 
contours that seem common to English and Dutch. 
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Figure 4: Cricothyroid Touscle activity and subglottal air pressure variation 
In eight different contours. 
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Unmarked breath groups appear In Contours 1 to 8. They all show laryngeal 
activity associated with the Fq rises and falls, irrespective of the location 
of the Fq changes in the contour. Subglottal pressure also shows momentary 
rises and falls associated with the Fq variations. Contours 9 and 10 (and 13) 
are examples of marked breath groups. They too show laryngeal activity all 
through the contour. Subglottal pressure also reflects the Fq variation except 
for the Fq rise on the utterance-final syllable, where Pg is falling. 

If we look upon a Fq contour as the product of both ± breath group and ± 
prominence specifications, there are at least two ways of sorting out the re- 
spective effects of these two prosodic categories. One way is to consider the 
breath group as the overall configuration of Fq changes whose actual distribu- 
tion over the contour is a function of the prominence specification of the syl- 
lables. The relationship between [BG] and [PROM] is then: all Fq changes are 
the Implementation of [BG], while some of them simultaneously implement [+PROM] 
by occurring on the prominent syllables. (Whether a Fq change is a good cue for 
prominence depends mainly on its timing with respect to the syllable boundaries; 
see van Katwijk, 1974.) An alternative way is to consider the unmarked breath 
group to be the gradually declining base line of the overall Fq contour, which 
further consists of major Fq changes that implement [+PROM] and are superimposed 
on the breath group. The marked breath group is the same Fq base line as the 
unmarked, but a rise in Fq is added near the end. This rise, unlike some other 
major Fq changes that precede it, does not implement [-(-PROM]. 

Evidently the major difference between these interpretations is that in the 
first view breath group is completely synonymous with Fq contour, while in the 
second view breath group in fact equals declination line, with or without a 
terminal rise. The second interpretation is a possible paraphrase of the views 
expressed by Lieberman (1970) and Lieberman tt al. (1970), while the first re- 
flects our own position. It is clear that these differences of opinion are 
situated on the more abstract, linguistic level (where mental constructs such as 
breath group and prominence belong). Let us therefore examine separately what 
agreement there may be in specifying the articulatory correlates of these pro- 
sodic categories. 

Lieberman (1970) and Lieberman et al. (1970) assume that the articulatory 
correlate of [-BG] is the pattern of respiratory muscle control that can be 
used to generate a relatively steady subglottal air pressure contour; [+BG] in- 
volves the participation of laryngeal muscles which produce the contour-final 
Fq rise; [-fPROM] syllables are characterized by momentary variations in both 
subglottal pressure and laryngeal tension. Lieberman further assumes that in- 
creased Pg on [+PROM] syllables is the "archetypal," primary correlate of this 
feature, while increased CT activity is a secondary characteristic. Our own 
view is that Ps controls the lower declination line ozily and that all major Fq 
changes (whichever prosodic category they implement) are under laryngeal con- 
trol. Thus we do not think that changing Pg can be the primary articulatory cor- 
relate of major Fq variations, since the Pg variations in our data that are 
associated with accentual or intonatlonal features are too small to account for 
the full extent of the Fq change. 

Lieberman (1970) appears to associate [+PROM] with rising-falling Fq. When- 
ever linguistic stress is manifested by unidirectional Fq changes, i.e., by a 
simple rise or a simple fall, Lieberman considers these changes as the acoustic 
correlates of [+ACCENT UP], [-fACCENT DOWN], not of [+PROM] . He says that their 
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artlculatory implementation is under strictly laryngeal control (i.e., without 
accompanying Pg variation). By contrast, our own interpretation is that [+PROM] 
can be realized as rising-falling, rising, or falling Fq (van Katwijk, 1974). 
This assumption is necessary to explain how pairs of contours such as 1 and 2, 3 
and 4, 5 and 6 are considered as free variants by native speakers of Dutch 
(*t Hart and Cohen, 1973): both variants Implement the same type of breath 
group with the same numbet of linguistic stresses on the same syllables. Thus 
it would not be plausible to consider the same syllable as [-(-FROM] in one vari- 
ant and ap [+ACCENT UF] or [+ACCENT DOWN] in the other. Apart from this differ- 
ence in linguistic interpretation, we share the view with Lieberman that the 
articulatory correlate of these unidirectional Fq changes is to be found in 
laryngeal maneuvers. 

CONCLUSION 

Our experiment suggests that the gradually falling baseline of a Fq contour 
is controlled by the slowly decreasing subglottal air pressure, while the major 
deviations from this baseline (i.e., the rises and falls in Fq) are caused by 
the action of the cricothyroid muscle. The increasing activity of this muscle 
raises Fq* its continued contraction maintains high Fq, and its relaxation low- 
ers Fq. Thus, the major differences between any two Fq contours appear to be 
related systematically to differences in the activity of this single muscle. 
Other laryngeal or respiratory muscles may well assist in producing a change in 
Fq, but their effect is secondary. 
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A Cine fluorographic Study of Vowel Production* 
Thomas Gay^ 

Haskins Laboratories i New Haven, Conn. 



The purpose of this experiment was to study the effects of 
changes in both phonetic context and speaking rate on the movements 
toward and attainment of target positions for the vowels /i/, /a/, 
and /u/. Two subjects read lists of nonsense words containing these 
vowels in vowel-^consonant-^vowel (VCV) combination with the consonants 
/p/» /t/, and /k/, at both slow and fast spe£ikin£, rates. Laterals- 
view X-ray films were recorded along with the acoustic signal. Re- 
sults showed that during slow speech the target positions of both /i/ 
and /u/ remain highly stable across changes in both the preceding and 
following consonant and vowel. The production of /a/, although not 
subject to right-* tO'-left effects beyond the following consonant, is 
sensitive to changes in the consonant, as well as in the vowel pre*- 
ceding the consonant. These coarticulation effects, however, are not 
^reflected as such in the acoustical measurements. The production of 
all three vowels during fast speech is characterized by articulatory 
undershoot and an upward shift in the frequencies of both the first 
and second formants. These results are discussed in terms of a tar-^ 
get-based description of vowels. 

Coarticulation has been the subject of considerable interest in recent 
physiological speech research, yet one that is still little understood. Al- 
though the variability in the production of a phone at all levels of the periph^ 
eral production process is well documented, there is little data on the exact 
nature and extent of most coarticulatory phenomena. 

One good example is vowel production. MacNeilage and DeClerk (1969) and 
Harris (1971) have shown that different motor command strategies can be used for 



*Also in Journal of Phonetics (1974) 2^, 255-266. 

'''Also University of Connecticut Health Center, Farmington. 
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a vowel depending upon the phonetic context in which it is placed. However, it 
is not clear whether these different strategies are reflected in differences in 
the target positions of the vowel. Indeed, it might be argued that coarticula- 
tion at the motor command level simply reflects a strategy to attain a quasi- 
invariant articulatory target position (MacNeilage, 1970). Unfortunately, how- 
ever, the available data that bear on this point are somewhat contradictory. 
For example, the physiological data of Houde (1967), MacNeilage and DeClerk 
(1969). and Gay, Ushijima, Hirose, and Cooper (1974) suggest that vowel stability 
is more the rule than the exception, while the X-ray data of Kuehn (1973) sug- 
gest a good deal of positional variability for the vowel target (specifically 
/a/). Target variability is also evident for faster speech (Gay et al., 1974; 
Kuehn, 1973) and destressed speech (Lindblom, 1963; Kent and Netsell, 1971). 

The purpose of the experiment reported here was to examine more closely a 
number of aspects of vowel production. The most important of these concerns the 
nature of a vowel target and ^Aether it can be defined in terms of a three- 
dimensional articulatory coordinate system (MacNeilage, 1970). We used clneflu- 
orgraphy to study the effects of changes in both phonetic context and rate of 
speech on the movements of the tongue and jaw during the production of selected 
vowels. This was designed to provide a descriptive account of the movements 
toward and attainment of vowel target positions under the constraints of a var- 
iety of Ixnc-iistic demands known to be sources of articulatory variability. In 
addition, we used acoustical analysis to determine whether any variability evi- 
dent at the articulatory level is reflected in the formant structure of the 
vowel . 

METHOD 

Subjects and Speech Material 

Subjects were two adult males, FSC and TG, both native speakers of American 
English. The speech material consisted of the consonants /p,t,k/ and the vowels 
/^,a,u/ in a trisyllable nonsense word of the form, /pVj^CV2P3/» where and V2 
were all possible combinations of /i,a,u/ and C was either /p/, /t/, or /k/. 
The 27 utterance types were randomly ordered into a master list. Each utter- 
ance, preceded by the carrier phrase, "It's a....," was produced at two speaking 
rates: slow (or normal) and fast. Each rate was based on the subject's own 
appraisal of comfortable slow and fast rates. A brief practice session preceded 
the filming session. The subjects wero; also instructed to say the first two 
syllables of the utterance with equal stress, and the final syllable unstressed. 

Data Recording 

Lateral-view X-ray films were recorded with a 16 mm cine camera at a speed 
of 64 fps. The X-ray generator delivered 1 msec pulses to a 9 in Image intensi- 
fier tube. Two lead pellets (^.5 mm diameter) were attached to the surface of 
the tongue along the midline. The pellets were located on the dorsum at points 
approximately 2 and 3 in from the tip. Cyanoacrylate was used as the adhesive. 
A barium sulfate paste was also used as a contrast medium on the tongue, and 
tantalum was applied along the midline of the nose and lips to outline those 
structures. The acoustic signal was recorded on magnetic tape and synchronized 
with the film record by means of camera-generated synchronization pulses. 
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Data Analysis 



The X-ray films were analyzed frane-by**frame, using a Perceptoscope f^lsn 
analyzer. The film was projected life size to a writing surface via an overhead 
mirror system* Each of the two pellets was tiracked In a coordinate system that 
used fixed landmarks as reference points. These points, along with an outline 
of the hard palate and upper central Incisors, were dravm on a master template. 
Photocopies of the master were then used as templates for the measurements of 
each film frame. >teasurements were made from the tln^ of /k/ release to the 
time of closure for the final /p/* Rechecks of a number of measurements re-* 
vealed a pellet measurement error of no more than 1 im. 

Each of the pellets was tracked In two dimensions: tongue height versus 
time and tongue backing versus time. Examples of these graphs for FSC are shown 
In Figure 1. The relative tongue backing measurements provided little In the 
way of useful data and will not be presented In this form. As can be seen In 
Figure 1, the ballistic patterns for both pellets are essentially the same; the 
only real dl ference Is a greater amplitude of movement for the anterior pellet 
during the production of the open vowel /a/. For no other reason, this pellet 
will be used to Illustrate the data In the Results section. 

Jaw movement In the vertical plane was also tracked frame^-by"- frame. This 
was done by measuring the vertical distance between the upper and lower central 
Incisors. 

Besides tracking the dynamics of tongue movement, pellet positions can also 
be used to construct vowel target positions In the traditional artlculatory 
sense. Figure 2 shows a typical configuration for the vowels /l,a,u/; the pel- 
let positions appear as the three points In the two-dimensional artlculatory 
triangle. 




Figure 2: Typical pellet positions at the target positions for /I/, /a/, and 
/u/. 
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Wide-band spectrograms of all the utterances produced during the X->ray run 
were made from the accompanying magnetic tape recording. Duration measurements 
were also made from the spectrograms. Durations from the time of /k/ release to 
the time of /p/ closure averaged 560 msec and 390 msec for FSC and 510 msec and 
370 msec for TG, for the normal and fast speaking rates, respectively. 

RESULTS 

This section is divided into three parts: the effect of phonetic context 
on vowel target position, the effect of speaking rate on vowel target position, 
and the acoustical consequences of these articulatory effects. 

Phonetic Context Effects 

In this section the effect of the intervocalic consonant on the target 
positions of the first and second vowels, the effect of the second vowel on the 
target position of the first vowel, and the effect of the first vowel on the 
target position of the second vowel will be described. 

Figures 3 and 4 summarize the effect of the intervocalic consonant on the 
target positions of the first and second vowels. These figures show the rela- 
tive positions in two dimensions of the anterior pellet at the vowel target 
(point of farthest articulator displacement) . For both subjects, the target 
positions for /i/ and /u/ in both pre- and postconsonantal positions are quite 
stable across changes in the consonant. Generally speaking, target variability 
for 111 and /u/ rarely exceeded 2 sm, and never exceeded 3 mm. For /a/, however, 
individual differences appear. While the positions for TG remain stable, the 
targets for FSC show a rather strong consonant effect, primarily in the height 
dimension. These differences occur for both the first and second vowels and 
span a distance of almost 8 mm. Displacement for both the first and second vow- 
els is least when the consonant is Itl and greatest when the consonant is /p/. 

The reason for these differences becomes apparent in the movement tracking 
measurements. Figure 5 shows the measurements for tongue height (anterior pel- 
let) and Jaw opening for the entire VCV utterance, for both subjects. While the 
data for TG show essentially identical movement patterns and target positions 
throughout the utterance, the data for FSC show variability for all three phonet- 
ic segments. This variability appears in both the displacement and velocity 
components of the curves. The tongue not only extends farther for the vowel, 
but also moves more quickly (steeper slope) toward its target when the consonant 
is /p/. 

Apparently, displacement for the vowel is greatest when the consonant is 
/p/ because the tongue and jaw are least involved in the production of this con- 
sonant. Both Itl and /k/, on the other hand, are characterized by greater de- 
grees of jaw closure; this probably acts to constrain the degree of opening for 
the adjacent vowels. Although the displacement differences for the tongue can- 
not be accounted for entirely by differences in jaw opening, the correlation be- 
tween the two measures is obviously quite high. This is shown by the fact that 
the curves for the tongue body closely shadow those for the jaw. 

In addition to affecting the displacement of its neighboring vowels, the 
consonant also conditions the timing of the movement toward the vowel. Figure 6 
shows the tongue height toeasurements for the utterances /Ipa/, /ita/, and /ika/. 
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For both subjects, nfloveaent toward the second vowel occurs earliest when the 
consonant Is /p/. This effect occurs whenever the tongue moves from /!/ or /ii/ 
to /a/.^ Again, these differences are apparently related to the independence of 
the tongue during the articulation of /p/. 

Individual differences in displacement also appear for /a/. While tongue 
displacement for the vowel remains stable in all consonant contexts for TG, con- 
sonant effects are evident for FSC (greater displacement for the second vowel 
when preceded by /p/, /k/, /t^, in that order). Interestingly, these differences 
appear to be due solely to differences in the timing of the movement from the 
consonant; in contrast, the effects described earlier were apparently caused by 
differences in the degree of displacement for the consonant. 

Figure 7 summarizes the effect of the second vowel on the target positions 
of the first vowel. This figure shows the target positions of the first vowel 
as a function of different second vowels in the /p/ consonant context. For both 
subjects, the target positions of all three first vowels are stable across 
changes in the second vowel (again, generally within a range of 2 naa) . This 
stability is also evident when the consonant is /t/ and /k/. Apparently, right- 
to-left effects do not extend across the consonant to the preceding vowel. 

Although the first vowel in the VCV utterance is not sensitive to any 
right-to-left effects beyond the consonant, the second vowel is subject to some 
left-to-right, or carryover, vowel effects; these effects, however, are fairly 
complicated and linked to the consonant. 

When thki consonant is /p/ the first vowel has no real effect on the target 
position of the second vowel (Figure 8). All three vowels maintain positional 
stability. However, whea the intervocalic consonant is either /t/ or /k/, left- 
to-right effects appear. Although the targets for both /i/ and /u/ (in the 
second vowel position) remain stable, the first vowel exerts a strong effect on 
the target position of /a/, this time for both subjects. These effects are 
illustrated in Figure 9 (two-dimension measurements) and Figure 10 (tongue 
height versus time measurements) for the /t/ consonant environment. These fig- 
ures show less opening for /a/ when the first vowel is /a/ than when the first 
vowel is either /i/ or /u/. 

At first glance these effects are quite surprising. It would seem intui- 
tively mor^ likely th^t greater degrees of opening for the second vowel would be 
caused by a more open first vowel. However, closer inspection of Figure 10 can 
explain th<^se effects. At the time of closure for the consonant (0 on the 
abcis.sa), both the tongue body and Jaw are in approximately the sane position 
for each of the three first vowels. Up until this point, however, the tongue 
is closing toward this position from /a/, whereas it is opening toward this 
position from both /i/ and /u/. Thus, the tongue is moving in different direc- 
tions at this point, and, in effect, has a head start towards the second vowel 
when the first vowel is close. 



This effect was also observed in the front-back dln^enslon when the tongue moved 
from /!/ to /u/, and vice vsrsa. 
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To determine the extent to which Jaw opening controls tongue height for an 
open vowel, the tongue measurements were plotted against the Jaw measurements 
(tongue - Jaw) to obtain the net movement curves for the tongue, l*e*. Indepen- 
dent from the Jaw* These data are shown for both subjects In Figure 11* Since 
the three vowels In this figure are characterized by almost equal degrees of 
displacement at and near the target, It Is apparent that the differences In 
opening for the vowel are controlled by the Jaw (Llndblom and Sundberg, 1971)* 

The jaw opening data are Interesting from another point of view: /u/ Is 
the only vowel characterized by a closed Jaw position* Both /a/ and /I/ are pro- 
duced with a more open Jaw— /a/ for obvloua reasons, and /I/, probably to make 
room for the bunching of the tongue* Although the degree of Jaw opening for /I/ 
shown In this figure approaches that for /a/. In most cases. Jaw opening for /I/ 
Is somewhat less than this, usually one-half to, at most, two-thirds that for 
/a/* 

The results of this section can be sunmarlzed as follows* The vowel tar-* 
gets of both /I/ and /u/ are highly stable across changes In either the conso- 
nant or the vowel* The targets for /a/ are more variable, especially for one 
subject* Target position variability, when it does appear, is conditioned by 
both the consonant (left-to-rlght and rlght-to-lef t effects) and the first vowel 
(left-to-right effects)* Rlght-to-lef t effects of the second vowel on the first 
vowel were virtually nonexistent* 

Speaking Rate Effects 

An increase in speaking rate generally resulted in a decrease in articula- 
tory displacement for the vowel* Although target undershoot was usually present 
in the speech of both speakers, in a nxmber of Instances an increase in speaking 
rate had no appreciable effect on the displacement of the vowel* These occur- 
rences were s^re frequent for TG* Undershoot occurred both more often and to a 
greater degree for /a/, and averaged 3-5 mm for FSC and 1-3 mm for T6* 

The context effects that appeared at the slow speaking rate were generally 
absent at the fast speaking rate* This is probably because Jaw movement was 
more restricted during fast speech; thus, the Jaw-dependent contextual effects 
tended to disappear* 

Perhaps the most Interesting fast-speech effect occurred for the /apa/ se- 
quence (illustrated in Figure 12)* For both subjects, tongue movement from the 
initial /k/ to the second vowel occurs as r single articulatory gesture through 
both the first vowel and the consonant; there Is no articulatory target evident 
for the first vowell In this instance also, the tongue moves somewhat indepen- 
dently from the Jaw, with the Jaw either holding steady (FSC) or closing (TG) 
for the intervocalic consonant while the tongue continues moving downward for 
the second vowel* 

Acoustical Consequences 

Wide-band spectrograms were made of all the utterances spoken by both 
speakers. However, because of the noxse produced by both the X-ray generator 
and cine camera, a number of tokens (/u/. In particular) could not be analyzed* 
The noise also obscured the first formant frequencies of both /I/ and /u/ to a 
point where most of these iKasures must be considered dubious* Nonetheless, the 
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acoustical measures still provided adequate Information to shed some light on 
the two most Important questions at Issue: (1) v^ether the context^nlependent 
artlculatory variability resulted In corresponding acoup*^^"^. variability, and 
(2) whether the undershoot effects evident during fast speech reduced the acoust- 
ic vowel triangle towards the neutral schwa. 

Although the spectrograms showed the presence of considerable acoustic 
variability, this variability could not be attributed to any of the coartlcula- 
tlon effects described earlier; acoustic variability occurred almost at random* 
In fact, the only consistent acoustic effect occurred for T6 where the first and 
second formant frequencies of /a/ showed a consonant effect. First and second 
formant frequencies for /a/ averaged 775 Hz and 1300 HZ when the intervocalic 
consonant was /p/ and 825 Hz and 1425 Hz when the intervocalic consonant was 
either /t/ or /k/. We should remember, however, that the coarticulatlon effects 
of the consonant occurred cnly for FSC and not T6! The artlculatory targets of 
TG were stable for these utterarxes. The ranges for first and second formant 
frequencies across all consonants are shown in Table 1. 



TABLE 1; Ranges of formant frequencies for all occurrences of /I/, /a/, and /u/ 
during the slow speaking rate condition. Values are rounded to the 
nearest 25 Hz« 

Subject FSC Subject TG 







Fl 




F2 




Fl 




F2 


/!/ 


350 


- 450 


1850 


- 2125 


475 


- 575 


2150 


- 2375 


/a/ 


700 


- 850 


1150 


- 1375 


750 


- 850 


1275 


- 1475 


/u/ 


450 


- 525 


875 


- 1025 


575 


- 625 


950 


- 1075 



The effect of an Increase in speaking rate on the formant frequencies of 
all three vowels is shown, for both subjects, in Figure 13. These graphs show 
the F1-^F2 coordinate positions for all occurrences (where measurable) of /l,a,u/ 
at both the slow (s) and fast (f) speaking rates. 

For both subjects, an Increase in speaking rate is accompanied by an in- 
crease in the frequency levels of both the first and second formants. The in- 
creases are generally greater for FSC than for T6« The formant frequency mea- 
surements for both subjects show the same range of variation during fast speech 
as during slow speech. Some overlap of coordinate positions la also evident for 
the two speaking rates. The Increases in fonuant frequencies for /I/ and /u/ 
might be explained by the more open vocal tract observed for these vowels during 
fast speech (artlculatory undershoot for /!/ and /u/ results in a greater degree 
of openness). This explanation, however, could not apply to the formant fre- 
quency shift observed for /a/. The most important aspect of these measurements, 
however, is that the acoustic triangle is not reduced towards the neutral schwa 
during fast speech, i.e., artlculatory undershoot during fast speech does not 
produce the same acoustic result as artlculatory undershoot during destressed 
speech (Llndblom, 1963).^ These different acoustic effects are probably related 



The notion of vowel neutralization during fast speech does not hold up even if 
the somewhat dubious measures of Fl for /I/ and /u/ are discounted; the upward 
shift of F2 for /i/ precludes this possibility. 
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to differences In the magnitude of artlculatory undershoot characterizing the 
vowel during fast and destressed speech. 



DISCUSSION 



The major findings of this experiment can be summarized as follows. During 
slow speech, the target positions of both /!/ and /u/ remain relatively stable 
across changes In both the preceding and following consonant and vowel. The 
production of /a/, although not subject to rlght-to-left effects beyond the 
following consonant. Is sensitive to changes In the consonant, as well as In the 
vowel preceding the consonant. These coartlculatlon effects, however, are not 
reflected, as such. In the acoustical measurements. The production of /I/, /a/, 
and /u/ during fast speech Is characterized by artlculatory undershoot and an 
upward shift In the frequencies of the first and second formants. 

These results tend to be somewhat perplexing; there Is, on the one hand, a 
strong tendency for target position stability for /I/ and /u/, and on the other 
hand, almost as strong a tendency for >'arlablllty for /a/. The extent of these 
differences In variability Is considerable: maximally, 3 mm for /I/ a^id /u/, 
and 8 mm for /a/. Yet acoustic variability for /a/ is no greater than tliat for 
/i/ ai>d /u/. This seems to indicate that the acoustical properties of /i/ and 
/u/ are more sensitive to artlculatory variability than those for /a/, at least 
for the parameters measured in this study. However, Stevens* (1972) view that 
opening for an open vowel, for example, can be perturbed considerably without 



any change occurring in the acoustic output is accomi^|dated by these data only 



if the acoustic variability observed were the result Qflan artlculatory perturba- 



tion not measured in this experiment (pharyngeal cavity size. In particular). 

Perhaps it is the relative acoustic insensitivity to artlculatory variabil- 
ity that allows the speech production mechanism a certain degree of latitude in 
the production of /a/j on the other hand, the effects might be primarily iner- 
tlal. VAiile /I/ and /u/ targets are attained primarily by movements of the 
tongue* opening for /a/ is controlled by the Jaw. It is conceivable that be- 
cause of its greater mass and the nature of its suspension system, the Jaw can- 
not be moved about with the same degree of accuracy as the tongue. 

An increase in speaking rate is also accompanied by artlculatory variabil- 
ity; undershoot for the vowel target is more the rule than the exception. How- 
ever, reduction towards schwa is not evident in the acoustic measures. This, of 
course, is not necessarily an unexpected result. If vowels produced during 
faster speech were neutralized, fast speech would be characterized by unintellig- 
ible strings of consonants and schwas. Even though undershoot for the vowel is 
evident for both fast and destressed speech, it is obvious that the two features 
are controlled by two different strategies. The differences in the acoustic 
effect are probably due ::o the magnitude of artlculatory differences. 

Unlike stress effects, speaking rate effects cannot be attributed solely to 
artlculatory sluggishness. The data of both Gay et al. (1974) and Gay and 
Ushijima (in press) show (for the same two subjects used in this experiment) 
that vowels produced during fast speech are characterized by a decrease in the 
activity level of the muscle; In other words, undershoot is programed Into the 
gesture. This means, in effect, that a gesture towards a vowel la not directed 
toward one specific target position. The gesture can be modified to some degree 
without the loss of whatever perceptually significant acoustic feature. Units 
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the vovdl. Although the acoustic data during fast speech show an upward shift 
In both the first and second formants for both subjects^ It Is not known whether 
these shifts occur slmpxy within the field of each Individual vowel or represent 
a generalized upward shift of the entire triangle* 

Because variability is built into the production of a phone at a level 
higher than the peripheral speech mechanism, a vowel target cannot be internal- 
ized , much less operationally defined, as an Invariant event. Nonetheless^ 
MacNeilage*s (1970) three-dimensional coordinate i^ystem still seems to be the 
best basis for describing a vowel. However, such a specification would have to 
be expanded to Include a spatial field, the boundaries of which are defined by 
the acoustic limits of the vowel* 

Although the data of this study easily fit into a field-specified vowel 
system, the entire schema is by no means complete. First, this experiment 
studied only the tongue*- jaw system; the entire pharyngeal cavity remains unspeci- 
fied* Second, the point vowels^ although delimiting, and perhaps normr.lizing 
the vowel space for a given individual, cannot serve to specify all the vowels 
of a language (Lieberman, in press). Indeed, before a general model of vowel 
production can be posited, the intermediate vowels must likewise be specified. 

In summary, this experiment produced two major findings. First, articula- 
tory variability in terms of vowel target position exists, but not to a degree 
that correlates to the existing acoustic variability. Second, artlculatory 
variability also occurs with an increase in speaking rate; speaking rate effects, 
however, unlike stress effects, are accompanied by an upward shift in formant 
frequencies* 
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Mechanisms of Duration Change* 



S. Harris"*" 
Haskins Laboratories » New Haven^ Conn« 



ABSTRACT 

Changes in stress » speaking rate» and terminal consonant are 
known to modify the duration of vowels in spoken utterances (Lehiste, 
1970) » Electromyographic investigation permits a detailed examina- 
tion of the mechanisms underlying these observed effects; in partic- 
ular, it is possible to test Lindblom*s (1963) hypothesis that con- 
text-dependent vowel color alterations result from a change in the 
timing of the signals to the articulators, rather than from a reorga- 
nizatisjn of the articulatory process « Results suggest that the ar- 
ticulatory process itself is reorganized, and that reorganization is 
different for the three types of change* 

INTRODUCTION 

In 1963, Lindblom wrote a classic paper on the effects of variations in 
word stress on the target formant position of vowels « The object of his experi- 
ments was to show that the so-called "vowel neutralization pheno^non** could be 
derived from a very simple model of upper vocal tract control* The phenomenon 
itself is well-known; briefly, as a syllable is destressed, its vowels will tend 
to be miore neutralized, as well as shorter in duration* Lindblom suggested that 
the neutralization is a consequence of the shortening* He made a nuoiber of 
spectrographic measurements, showing a regular relationship between duration of 
vowels and their target formant positions* The relationship is consonant with a 
model in which the signals sent to the articulators are determined by a stored 
template for each vowel, independent of its stress position; If signals are sent 
to the articulators at rates greater than some critical value, target position 
is not attained before new signals arrive; thus, the shorter the vowel, the 
greater the target undershoot. In his original paper, Lindblom (1963) suggests 
that the same model can be applied to the effects of changes in speaking rate 



*Paper presented at the Speech Comunication Seminar, Stockholm, Sweden, 1-3 
August 1974, and to be published in the conference proceedings (Stockholm: 
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and changes In stress on vowel color. The model presumably could be extended tc 
apply to any context effect that might be expected to cause changes In vowel 
duration, such as the well-known effect of the voicing status of the final con- 
sonant on the vowel* Lindblom did not suggest the level at which constant sig-- 
nals are presumed to be sent to the articulators. The simplest suggestion is 
that the signals to the muscles might be constant. If this were true, we might 
expect electromyographic (EMG) signals to the muscles to be of equal size, under 
conditions of varying stress , speaking rate, and voicing status of the final 
consonant. A secondary result of Lindblom^ s model is that the timing relation- 
ship between consonant and vowel signals may be expected to change, as the dura- 
tion of the vowel changes. 

The experiment described here was the latest in a series of tests of 
Lindblom* s hypothesis; the results will be compared with a number of related ex- 
periments. 

EXPERIMENT 

Procedure 

A single speaker recorded the four-syllable nonsense utterances /apipipa/ 
and /apiplba/ under several conditions: stress was placed on either the first 
or the second /i/; there were two speaking rates; and thus, there was a total 
of eight utterance types. Utterances were arranged in random lists of 28, with 
slow and fast lists alternated. After the removal of faulty utterances, there 
were between 24 and 33 utterances of each type for averaging. 

Hooked-wire electrodes were inserted bilaterally into the anterior pcrtion 
of the genloglossus muscle (GGA) * A single electrode was inserted Into ".he 
posterior portion of the genloglossus muscle (GGP). Results from other 'place- 
ments for this experiment will not be discussed here* Electrode construction 
and placement are discussed in Hirose (1971) . 

EMC data were amplified and recorded on 16-channel instrumentation tape, 
together with the acoustic signal and code pulses for later computer analysis. 
After inspection for artifacts, EMG signals were processed and averaged by tech- 
niques previously described by Port (1971) and Kewley-Port (1973). 

Wide-band spectrograms were made of 56 utterances, half from near the be- 
ginning and half from near the end of the recording session. Thus, acoustic 
records were available of seven utterances of each type. On the spectrograms, 
measurements were made of the duration of the two syllables containing /!/, from 
the release of closure to closure or the end of voicing. The second and third 
formants were measured in each syllable at their peak frequency* 

Results 

Averaged EMG curves for one of the GGA leads are shown in Figure 1. The 
point on the time line marked '^zero** is the averaging lineup point for each 
utterance^ and is at the /p/ closure of the second syllable. The duration of 
the acoustic events is indicated abo^^e each figure. Clearly the EMG signal 
associated with each stressed syllable has a somewhat larger peak height, and a 
somewhat longer duration, than its unstressed counterpart. This is in accord 
with earlier results (Harris, 1973)* Further, there is a systematic tendency 
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for the fast speaking condition to show smaller peak heights than the slow con-^ 
dltion (Gay^ Uahijlma, Hirose» and Cooper, 1974) « Although it is less obvious, 
there is no overall systematic trend for vowel peaks associated with terminal 
/p/ and /b/» 

Correlation coefficients were calculated between various measures* Results 
are shown in Table 1« Correlations are quite high» and are uniformly significant 



TABLE 1 





^2 


F 


GGAk 


GGAl 


GGP 


duration 


.89 


.81 


.57 


.60 


.56 


^2 






.48 


.57 


.54 








.55 


.66 


.69 


GGAr 








.93 


.90 


GGAl 










.97 



at the •OS level or greater • The correlations between the formant levels and 
duration are essentially a reiteration of Llndblom's results, in a somewhat dif- 
ferent form~that is, formants reach a more extreme value as duration lengthens. 
The correlations between peak height and duration, and peak height and formant 
value, however, are a contradiction of Lindblom's hypothesis* 

vniile the overall correlations are rather high, they are somewhat mislead- 
ing, since the effects of the terminal /b/ on half the utterances are masked* 
More detailed results, for the second syllable only, are presented in Table 2* 



TABLE 2 

p b 
Slow Fast Slow Fast 



duration 


^2 


^2 


^2 


^2 




^2 


^2 


^2 


















in msec 


167 


145 


131 


117 


201 


179 


147 


133 


F2 In Hz 


1945 


1907 


1857 


1807 


1936 


1949 


1879 


1816 


F3 In Hz 


2320 


2232 


2303 


2191 


2345 


2300 


2250 


2177 


GGAj^ in yV 


876 


616 


471 


460 


745 


553 


460 


454 


GGA^ In yV 


311 


263 


227 


189 


283 


247 


229 


191 


GGP in yV 


343 


271 


242 


186 


299 


246 


215 


172 
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An Inspection of the table shows the following results: 

1) The expected effects of stress p speaking rate, and voicing on the 
duration of the second ^syllable are obtained. 

2) Values of F2 an<? F3 are more extreme for slow speech and for 
stressed production. However* there is no systematic tendency for 
the values to differ for terminal /p/ and /b/. The result is not 
surprising in view of the classic literature. So far as I know, 
it has never been suggested that vowels are more neutral before 
voiceless consonants. 

3) Stress and speaking rate affect the peak values of muscular activ- 
ity. In ten of twelve possible comparisons, peaks are h^^* ^ for 
terminal /p/ than for terminal /b/. The result suggest. ^-jemr 
atic trend. However, Raphael (1974) has examined the peak 
heights associated with vowel production in a large number of 
utterances in which the terminal consonant is a voiced or voice- 
less stop or fricative. His results show a prolongation of the 
vowel signal before voiced consonant, but no systematic differ- 
ences in amplitude. Obviously, this result requires more system- 
atic examination. 

DISCUSSION 

It is interesting to consider Lindblom's explanation of the stress effect 
in light of the picture it gives of the organization of running speech. In his 
model, pushed to an extreme, a series of signals are sent to the articulators, 
which depend for their identity on the phonetic specification of the segments. 
Changes of stress or speaking rate will affect the timing of the arrival of 
these signals (as will certain segmental characteristics of the sequence, by ex- 
tension) but not their relative size. The resulting acoustic output will vary, 
not because of variation In the signal size, but because of changes In the rela- 
tive timing of, for example, successive vowel and consonant signals. Further- 
more, differences between contexts vary along the single dimension of time. 

The results described above suggest that the real picture Is substantially 
more complex. Signal size for vowels varies systematically with duration for 
changes in stress and speaking rate, and does not (apparently) vary with the 
duration changes conditioned by the voicing status of the terminal consonant. 
Let us examine the stress and speaking rate variations first, since Llndblom^s 
model is intended to apply only to them. Is the target position observed due 
entirely to the size difference observed, or may the result be due in part to 
Lindblom^s proposed mechanism? In Lindblom's model, any duration change auto- 
matically generates a change in target position, unless it is counteracted by 
some other adjustment. Therefore, if vowels are longer preceding /b/, then sig- 
nals should be smaller, if the acoustic target for the formants Is to be the 
same. As noted above, in the present experiment, there is a trend in this di- 
rection, although the same trend Is not seen in other experiments. This point 
must be examined further. 

In the present experiment, the effects of stress and speaking rate are at 
least qualitatively homogeneous. However, there is some evidence that consonant 
and vowel signals do not behave in parallel ways under the two manipulations. 
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Sotxie years ago» we found that consonanta associated with heavily stressed sylla- 
bles will be produced with stronger associated articulation (Harris^ Gay> 
Sholes, and Llebermaa> 1968). Gay et al. (1974) have found that faster speaking 
rates are associated with higher consonant peak heights. These effects are In 
opposite directions to the associated vowel articulations • Our information is, 
however 9 concerned with the somewhat special circumstance of the labial conso- 
nant surrounding the vowel » and should be examined in more varied environments. 

Llndblom^s model is in one respect similar to a ouich earlier formulation 
proposed by Cooper, Libennan, Harris, and Grubb (1958). In both models, con- 
stant signals yield a variable output, ^*ne to variability in timing. In all the 
experiments we have performed in manlp .Ing stress, speaking rate, and con- 
text, we seem to get the opposite rest The relative timing of consonant and 
vowel gestures to different articulators seems to be very closely time locked, 
while the amplitude and duration of gestures for particular segments vary sub- 
stantially. We will be interested to see what happens in those conditions whera 
the same articulator Is Involved in both consonant and vowel gestures. 
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The Physiological Control of Duratioiial Differences between Vowels Preceding 
Voiced and Voiceless Consonants In English* 

■f 

Lawrence J. Raphael 

Hasklns Laboratories, New Haven, Conn. 



ABSTRACT 

A series of two electromyographic experiments was designed to 
determine the nature of the muscular activity underlying the articu- 
lation of consonant-vowel-consonant (CVC) syllables in which identi- 
cal vowels differed in duration because of the voicing characteristic 
of the consonant that followed them. Results Indicate that the most 
reasonable hypothesis to explain the durational differences posits a 
sustention of muscular activity in the artlculatory gesture of the 
vowel preceding voiced consonants, relative to the gesture for vowels 
preceding voiceless consonants. It is noted that the acoustically 
determined differences between vowels, the differences between the 
durations of the muscular-articulatory gestures for the vowels, and 
the temporal disvjlacement of the final consonant peaks generally show 
remarkably similar values. 

The differences between the durations of vowels preceding voiced and voice 
less consonants in English is well documented in the phonetic literature (Locke 
and Heffner, 1940; Kenyon, 1951; Peterson and Lehlste, 1960; Hous^, 1961; 
Gimson, 1962). Investigators have described and/or commented on the perceptual 
consequences of these differences (Denes, 1955; Noll, 1960; Jakobson and Halle, 
1967; Raphael, 1972), and have theorized as to whether the variation In vowel 
duration is a physiologically mandated behavior, one that is learned, or, to 
some extent, both (Zimmerman and Sapon, 1958; Peterson and Lehlste, I960; House 
1961; Delattre, 1962; Elert, 1964; Chen, 1970). 

Little, however, has been discovered or written about the physiological ac 
tlvlty that must underlie durational differences, no matter what their cause. 
The present study was undertaken to specify the muscular activity governing the 
artlculatory gestures for vowels preceding both voiced and voiceless consonants 
The studies referred to above, whether based on Impressionisnic evidence or on 
the analysis of acoustic records, suggest three hypotheses for the mechenism 
that renders vorols longer in duration before voiced cc ^sonants than before 



*To be published in Journal of Phonetics (1975) 2. -5-33. 
"""Also Herbert H. Lehman College of the City University of Sew York. 
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voiceless consonants. The first » and perhaps the simplest, posits a greater 
duration of muscular activity for a vowel preceding a voiced consonant than for 
one preceding a voiceless consonant. Under this hypothesis final consonant ar- 
ticulations of either voicing type would be laore or less Identical, with the 
same time of onset relative to the offset of the preceding vo^^l* 

The second hypothesis posits muscular activity of the same duration for 
vowels In both the voiced and voiceless environments* The durational difference 
would then be effected by a difference In the timing of the onset of muscular 
activity of the following consonants In relation to the offset of preceding vow-^ 
el activity: relatively earlier In the voiceless case and relatively later In 
the voiced case. Such differences have been found In lip and Jaw movements for 
stops [Kozhevnlkov and Chlstovlch, 1965; Ohala^ Hlkl, Hubler, and Harshman, 
1968; Chen, 1970; Kim and HacNeilage, 1972 (cited In MacNellage, 1972); 
Leanderson and Lindblom, 1972], although the magnitude of these differences does 
not appear to be great enough to account for the durstlonal difference between 
English vowels (MacNellage, 1972). 

The third hypothesis merges the first two and posits differences both in 
the duration of muscular activity for vowels and in the relative timing of the 
onset of the muscular activity for the following consonants* 

EXPERI^^ENT I 

Procedure 

We constructed a series of real-word, minimal pair, CVC tesf'utterances in 
which the articulation of the initial consonants and vowels would be essentially 
controlled by a muscle or set of muscles different from and independent of the 
muscles controlling the articulations of the final consonants* For example, in 
the minimal pair leaf-^leave , it was assumed that the initial consonant and the 
vowel would be controlled by lingual muscles, whereas the final consonant would 
be under the control of labial muscles* (This proved to b^ the case for a pair 
such as leaf -leave , but for other pairs, namely these containing back, lip-** 
rounded vowels, the separation of muscle function was not as clear as had been 
desired. For pairs such as bou^ht-bawd and moat-mowed there was genloglossus 
activity for the vowel as well as for the final consonant* A more anterior 
electrode placement might reduce or eliminate this activity* Thus, some of the 
information concerning onset of muscle activity for the final consonant was ob- 
scured, although by no means completely*) Of the six minimal pairs used, three 
were of a lablal-to-llngual configuration (mowed-moat ^ bawd-bought , moos-moose) , 
and three of a lingual-to-lablal configuration ( leave-leaf , thleve-thlef , 
lab- lap ) • 

Two subjects took part in the experiment* Both read the words in isolation 
from a series of ten randomized lists* At least 12 and as many as 19 tokens of 
each type iiiere used to produce the averaged electromyographic (E}f6) curves* 

The muscles explored for labial articulation were the orbicularis oris and 
the depressor anguli oris* Lingual articulation was investigated by recording 
EtIG signals from the genloglossus muscle* 

Concentric needle electrodes of standard Diss type were used for Insertions 
into the muscles* Both the EMG output and a voice trace were recorded on 
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magnetic tape for subsequent computer processing. The onset of voicing was used 
as a reference line-up (zero) point in the data manipulation* 

R e sults and Discussion 

The only effect consistently found was that of greater duration of muscular 
activity in the articulation of vowels preceding voiced consonants* Figure 1 
shows the most common manifestation of this effect:^ the peaks associated with 
the vovfel articulation occur almost simultaneously in both the voiced and voice* 
less cases; there is a sustention of muscular activity in the voiced case rela- 
tive to the voiceless case; the onsets of the muscular activity for the follow- 
ing consonants occur at approximately the same time relative to the offset of 
muscular activity for the r receding vowel; the onset durations and slopes for 
the muscular activity associated with the following consonants are generally 
equivalent* Certainly there is no durational difference between the onsets of 
consonant activity (relative to the preceding vowel) on the order of the dura- 
tional differences between vowels as determined from acoustic records. For the 
utterances and subject of Figure 1, the average vowel duration for thief was 
150 msec, and that for thieve 360 msec* The durational difference between the 
muscular activity underlying the vowel art iculat ions » with reference to the time 
each EMG curve reaches its base line, is on the order of 220 msec, quite ^lose 
to the 210 msec difference between the durations of the vowels in the acoustic 
measurement • 

This sustention of muscular activity following the peak for the vowel was 
found for both subjects in most cases* Figure 2 shows typical examples of the 
averaged durational differences between the EHG signals caused by the susten- 
tion* Figure 3 displays the temporal displacement of the terminal consonant 
peaks associated with the vowels of Figure 2. Note that these final-consonant 
peaks were displaced from each other by time values approximately equal to both 
the durational differences between the EMG signals for the preceding vowels and 
their acoustically determined durational differences* The data for the vowel 
duration differences (both acoustic and E^!G) and the temporal displacement of 
the EMG peaks of the final consonants are summarized in Table 1* 

One other articulatory strategy can be found underlying the durational dif- 
ferences between vowels: in two cases there is a delay in the onset and peaking 
of muscular activity for the vowel preceding the \ jiced consonant (Figfiires 2, 3^ 
bottom graphs). This, in turn, causes a delay in the peak of muscular activity 



in order to maximize, visually, the temporal relationships between ^HG traces, 
the peaks for all vowels and onsonants have been equated in height, regardless 
of their actual microvolt values « However, these values are recorded: in Figure 
1» the microvolt values on the left-hand ordinate are those of the vowel (gen- 
ioglossus) gesture, and the microvolt values on the right-hand ordinate are 
those of the final consonant (orbicularis oris) gesture* In Figures 2-5, the 
microvolt values for the vowel preceding the voiceless consonant and for the 
voiceless consonant itself are shown on the left-»hand ordinate; those for the 
vowel preceding the voiced consonant and for the voiced consonant itself are 
shown on the right-hand ordinate* The actual values shown are peak values for 
vowel or consonant* 
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Figure 2: Paired EMG signals for Identical vowels* one preceding a voiced and 
Che other a voiceless, syllable-final consonant. 
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Figure 3: Paired EMG signals for volced/volceless syllable-'final consonants. 
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TABLE I: Comparison of vowel duration differences as determined spectrograph- 
ically and by EMG measurements wlt::h temporal dlsplaceinent o£ final 
consonant EMG peaks. 



Consonant Peak 





Duration Differences (msec) 


Displacement (msec) 




Vowel 
(Acoustic) 


Vowel 
(EMG) 


(EMG) 


Subject UR 


leaf-leave 


260 


230 


210 


bought-bavd 


155 


110 


115 


moose-moos 


18U 


160 


80 


moat-mowed 


190 


185 


135 


thief- thieve 


300 


290 


170 


lap- lab 


175 


180 


165 


Subject KSH 


leaf-leave 


175 


185 


250 


bought-bavd 


90 


100 


90 


noose-moos 


100 


120 


110 


moat-mowed 


125 


160 


150 


thlef-thleve 


210 


220 


200 


lap-lab 


105 


85 


130 



for the vowel» so that even though the slopes of the offsets of EMG activity are 
virtually identical in both voiced and voiceless cases, the separation of the 
final voiced and voiceless consonant peaks in this case is still the result of 
the difference in timing and duration of vowel articulation. It may be possi- 
ble, since the initial consonants in these utterances are semivocalic in nature, 
that part of the durational difference usually carried by the vowal is absorbed 
by the preceding consonant. Although the acoustic records do not reveal any 
consistent differences between the durations of these initial consonants, the 
EMG signals for the initial HI and /m/ in the voiced syllables show a slower 
onset and later peak than do those in the voiceless syllables. Thus the data do 
not provide a consistent explanation of this effect. Further, there remains the 
question of why one subject shows the effect for /!/ and not for /m/, and the 
other for /m/, but not for /I/. 

EXPERIMENT II 

Procedure 

The minimal-pair utterances of the second experiment were disyllables be- 
ginning with schwa, followed by /p/. The interconsonantal vowel was variously 
/llecae aA^ouu/. The final consonant was either /p/ or /b/. Utter- 
ances of these types provided negligible coartlculatlon effects, for the muscles 
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investigated » between consonant and vowel, or between the vowel and the final 
consonant • One of the two subjects who provided data in this experiment had al-- 
so participated in E3q)erin:ent 1. A third subject (who had also provided data in 
Experiment I) read an altexmate list of utterances ending in /k/ or /g/ from 
which only final-consonant EMG data were obtained* 

Data were obtained from the orbicularis oris muscle for the labial stop 
consonants and from the mylohyoid muscle for the alternate utterances ending in 
vular stops* Vowel data were obtained from the genioglossus and inferior longi- 
tf^dinal muscles. Electrode insertions were made as described by Hirose (1971) 
for the orbicularis oris» mylohyoid » and genioglossus muscles* The insertion 
into the inferior longitudinal muscle was made at the lower tongue surface near 
the back of the anterior third , approximately 1 cm from the lateral margin and 
roughly parallel to the lower surface of the tongue at a depth of approximately 
3 mm* 

The EMG signals were stored on magnetic tape for subsequent data process- 
ing. The onset of voicing of the interconsonantal vowel was used as a reference 
line-up (zero) point for the data manipulation and displays* 

Results and Discussion 

As in Experiment I» there was a greater duration of muscular activity in 
the articulations of vowels preceding voiced consonants than in those preceding 
voiceless consonants* Both subjects showed this main effect for all muscles and 
for all vowels for which data were obtained in this experiment* Figure 4 (a, b) 
displays the genioglossus data for the two subjects* The sixailarity of the vow- 
el sustention effects, as shown in the EMG curves, among subject UR in Experi- 
ment X (Figure 2) and in Experiment IX, and subject FBB in Experiment XX is 
readily apparent. The similarity is further reflected in the EMG curve for the 
inferior longitudinal muscle for subject LJR (Figure 4c) • 

The displacement of the final consonant peaks is illustrated in Figure 5 
for one of the subjects of this experiment and for the subject who read the al- 
ternate list of utterances ending in /k/ or /g/* As in the first experiment, 
the acoustic durational differences as determined from spectrograms, the EMG 
durational differences, and the temporal displacement values of the EMG peaks 
for the final consonants show remarkably similar values (Tables 2 and 3)* 

CONCLUSION 

The data presented here provide strong confirmation for the first hypothe- 
sis presented above* That is, the acoustically measured durational differences 
long observed between vowels preceding voiced and voiceless consonants are pri- 
marily controlled physiologically by motor commands to the muscles governing the 
articulators that are active in the formation of vowels* The timing of these 
commands is generally such that after the peak of the articulatory-muscular ac- 
tivity has been reached, the articulators are maintained (although not statical- 
ly) in shapes ^ i positions appropriate for vowels somewhat longer when they 
precede voiced consonants* 
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Figure 4: Paired EMG signals for Identical vowels, one preceding a voiced and 
Che other a voiceless, syllable-final consonant. 
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Figure 5: Paired EMG signals for voiced /voiceless syllable-final cousonants. 
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TABLE 2: Conparison of vowel duration differences as determined spectrograph- 
ically and by EMG oeasurements with temporal displacement of final 
consonant EMG peaks (for syllables ending In /p/ versus /b/). 



Consonant Peak 





Duration Differences 


(msec) 


Displacement (msec) 




Vowel 


Vowel 






(Acoustic) 


(EMG) 


(EMG) 


Subject UR 


/!/ 


135 


125 


145 


/I/ 


60 


70 


65 


/e/ 


130 


140 


120 


/£/ 


55 


55 


65 


/«/ 


85 


75 


100 


/A/ 


30 


40 


45 


/a/ 


85 


80 


105 


/o/ 


115 


105 


100 


iot 


155 


165 


150 


M 


45 


55 


60 


inl 


110 


115 


110 


Subject FBB 


/!/ 


150 


140 


150 


/!/ 


85 


95 


75 


/e/ 


170 


175 


150 


/e/ 


50 


50 


45 


/et/ 


130 




145 


IM 


65 


75 


80 


/a/ 


180 




195 


hi 


110 




120 


iol 


205 


190 


185 


/o/ 


65 


60 


70 


/u/ 


145 


150 


140 
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TABLE 3: Comparison of vowel duration differences as determined spectrograph- 
ically with temporal displacement of final consonant EMG peaks (for 
syllables ending in /p/ versus /b/)* 





Vowel Duration Difference 


Consonant Peak Displacement 




(Acoustic ~ Qsec) 


(EMG - msec) 


Subject KSH 


/!/ 


135 


130 


til 


45 


45 


1^1 


130 


140 


/£/ 


60 


70 


1^1 


155 


145 


IM 


45 


50 


/a/ 


185 


165 


hi 


115 


100 


hi 


135 


125 


hi 


45 


60 


hi 


115 


105 
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Effect of Speaking Rate on Stop Consonant-* Vowel Articulation* 

T. Gay* and T. Ushijima^ 

Haskins Laboratories » New Haven t Conn« 



ABSTRACT 



The purpose of this experiment was co study the effect of speak- 
ing rate on the articulation of the stop consonants /p^t/ in combina- 
tion with the vowels /i^a^u/. Two speakers of American English read 
lists of nonsense syllables containing /p»t/ in all possible vowel- 
consonant-vowel (VCV) combinations with /i,a»u/ at both normal and 
fast speaking rates* Electromyographic (EMG) records were obtained 
from the orbicularis oris, superior longitudinal, and genioglossus 
muscles* The EMG data were analyzed using the Haskins Laboratories* 
data system. For both labial consonant and lingual consonant produc- 
tion, the effect of an increase in speaking rate was an Increase in 
the activity level of the muscle (orbicularis oris and superior long- . 
itudinal)* However, for vowel production, the effect of an increase 
in speaking rate was a decrease in the activity level of the genio- 
glossus muscle. These results are discussed in relation to a general 
account of speaking rate control* 



INTRODUCTION 



In some recent experiments on the production of labial CV sequences, we 
showed that an increase in speaking rate involves more than a simple reordering 
of the timing of commands to the muscles (Gay and Hirose, 1973; Gay, Ushijima, 
Hirose, and Cooper, 1974)* Rather, the production of both the consonant and the 
vowel segments of a syllable during fast speech was shown to be characterized by 
changes in motor organisation as well as changes in motor timing « For labial 
consonants, the effect of an increase in speaking rate is an Increase in the 
activity levels of the muscles that control lip closure; however, for vowels, 
the opposite effect occurs, l.e*, an Increase in speaking rate is accompanied by 
a decrease in the activity level of the muscle (genioglossus) • 



*Paper presented at the Speech Communication Seminar, Stockholm, Sweden, 
1-3 August 1974, and to be published in the conference proceedings (Stockholm: 
Almqvist and Wiksell; and New York: Wiley)* 

'*'a1so University of Connecticut Health Center, Farmington. 

*^Visiting from University of Tokyo, Japan* 

(HASKINS LABORATORIES: Status Report on Speech Research SR-39/40 (1974)1 
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The purpose of the experiment; reported here va5 to obtain additional labial 
consonant data from both of our previous subjects and to extend our observations 
of VCV articulations to lingual CV sequences. 

M ETHOD 

Subjects were two adult laales, both native speakers of Americaii English • 
The speech material consisted of the consonants /ptt/ and the vowels /X^a^u/ in 
a trisyllable nonsense word of the form /k C V2 p9/> where and V2 ^re all 
possible combinations of /i^a^u/p and C was either /p/ or it/ . The utterances 
were randomly ordered into a master list. Each utterance (preceded by the 
carrier phrase, '^ItU a....") was read at two speaking rates: normal and fast. 
Each rate was based on the subjects* own appraisal of comfortable slow and fast 
rates. On the average , the fast speech was two- thirds to three-^fourths the dura- 
tion of the normal speech. 

For both subjects, conventional hooked-wire electrodes were impli»Ated in 
the orbicularis oris» superior longitudinal,, and genioglossus muscles. The or- 
bicularis oris muscle is largely responsible for closure of the lips, the super- 
ior longitudinal is active for tongue-tip elevation (for the production of /t/), 
and the genioglossus is a prima mover (protruding and bunching) of the tongue » 
EMG data from these muscles were recorded on magnetic tape and subsequently 
averaged using the Kaskins Laboratories* EII6 data system* The basic procedure 
was to collect EMG data for a number of tokens of a given utterance (in this 
experiment, between 10 and 15 repetitions), and to average the integrated EMG 
signals at each electrode position. 

RESULTS 

The effect of speaking rate on the production of the labial CV syllables is 
illustrated in Figure 1. This figure shows the averaged EMG curves of the orbic- 
ularis oris and genioglossus muscles for Subject FSC during the production of 
the utterance /ipip/ at both non&al and fast speaking rates. The orbicularis 
oris curves, as shown here, are associated with lip closure for the consonants, 
while the genioglossus curves are associated with tongue movements for the vow-^ 
els. **0** on the time axis represents the time of offset of voicing of the first 
vowel. 

This figure shows that for fast speech, orbicularis oris activity increases, 
while geniglossus muscle activity decreases . These changes, which coincide with 
our earlier findings, are consistent and occur for each subject and all utter- 
ances* 

Although the existence of these effects is consistent, the magnitude of the 
differences varies considerably, and the data do not show any clear preconsonan- 
tal or post consonantal vowel effects. We suspect that these inconsistencies are 
caused, at least in part, by trade-offs between lip closing and jaw closing. 

The decrease in genioglossus muscle activity for vowel production also 
occurs for each subject and for all utterances. These decreases imply that the 
undershoot observed for vowels during faster speech is programmed into the ges- 
ture and is not the result of a too fast succession of motor commands. 
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Our data also show that the lip rounding component of /u/ Is likewise pro- 
duced with greater levels of tuuscle activity during faster speech « This Is 
Illustrated in Figure 2, which shows orbicularis oris activity for the utterance 
/utap/ for Subject FSC* 

Figure 3 Illustrates the effect of speaking rate on the production of the 
lingual CV syllables. This figure shows the averaged EMG curves for the super- 
ior longlLudlnal and genloglossus muscles for the utterance /itip/» this time for 
Subject XG» These data show the same changes in muscle activity levels as the 
labial consonant data: an increase in speaking rate is accompanied by an in- 
crease in activity level for the consonant (superior longitudinal) and a decrease 
in activity level for the vowel (genloglossus). Again the saii» results occur for 
both subjects and all utterances » and the data do not show any consistent vowel 
effects » The different effects of an increase in speaking rate are especially 
interesting in this set of data because they demonstrate that different motor re- 
organization strategies can be used for different muscles of the same articula- 
tor. 
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The results of this experlssent can be sunimarlged as follows* For lip move- 
ment associated with either labial consonant production or rounding for a vowel, 
and for tongue- tip movement associated with lingual consonant production, an in- 
crease in speaking rate is accompanied by an Increase in the activity level of 
the muscle^ For tongue moven^nt during vowel production, in increase in speak- 
ing rate has the opposite effect: a decrease in the activity level of the mus- 
cle* The first finding implies an increase in articulatory effort and an in- 
crease in the speed of articulatory movement, while the second finding implies a 
decrease in articulatory effort, combined with a decrease in the speed of articu- 
latory movement and/or a decrease in articulatory displacement* 

DISCUSSION 

The results for both the orbicularis oris and the superior longitudinal 
muscles can be explained quite readily: the production of both /p/ and ft/ 



216 



requires a con^Xet^t occlusion of the vocal tract; thus» under the constraints of 
an increase in speaking rate, the articulators move faster and with greater ef- 
fort to produce that occlusion. 

The data for the tongue, however* cannot be explained so straightforwardly. 
Obviously, the reduction in EMG activity for the vowel during faster speech is 
not compatible with an **extra effort" or even a "timing only" (equal-effort) 
control mechanism. Rather, the decrease in articulatory displacement usually 
associated witn fast speech is built into the planning of the gesture. 

Our findings for vowel production also argue against the notion that a vow- 
el target is internalized as a set of invariant spatial coordinates. If a vowel 
is organized in terms of an articulatory coordinate system, the system must be a 
multiple coordinate one, or one characterized by an articulatory field. Another 
view, however, and one that might better explain our data, would be that a vowel 
Is internalized as a set of acoustic targets, and that the speech production 
mechanism uses any of a number of strategies to produce the required acoustic 
result. This view would also explain the differences in the tongue and lip data 
for /u/ during faster speech, i.e., a greater degree of lip rounding serves to 
compensate for the decrease in articulatory displacement of the tongue. 
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Jaw Movements During Speech: A Cinef luorographic Investigation* 
T. Gay**^ 

Haskins Laboratories, New Haven, Conn. 



Jaw movement has been the subject of considerable interest in recent physi- 
ological speech research fiTom two points of view: first, the Jaw has a reputa- 
tion of being largely responsible for many coarticnlatory phenomena; and second, 
the jaw has been shown to be involved in the control of tongue height for cer- 
tain vowels. The purpose of this paper is to examine in some detail Che move- 
ments of the jaw during the production of speech that varies systematically in 
both ^onetic context and speaking rat^^^. The data reported here are part of a 
larger cinef luorographic study on the d^mamics of both tongue and jaw movements 
during speech (Gay, iu press). 

Subjects were two adult males, FSC and TG, both native speakers of American 
English. The speech material consisted of the consonants /p/, /t/, and /k/ and 
the vowels /i/, /a/, and /u/ in a trisyllable nonsense word of the form /kipipa/, 
/kipapa/, /kipupa/, etc. The three consonants and vowels were thus arranged in 
all possible vowels-consonant- vowel (VCV) combinations. Each utterance was pre- 
ceded by the carrier phrase "It^s a...»," and was produced at two speaking 
rates: normal and fast. Each rate was based on the subject's own appraisal of 
comfortable normal and fast rates. The subjects were instructed to speak the 
first two syllables of the utterance with equal stress, and the final syllable 
unstressed. 

Lateral-view X-ray films were recorded with a 16 mm motion picture camera 
at a speed of 64 fps. The X-ray generator delivered 1 msec prJ ^es to a 9 in 
image intensifier tube. The X-ray filius were analyzed frame-by- frame » using a 
specialized film analyzer. The film was projected lite size onto a writing sur- 
face using an overhead mirror system. Jaw movement was tracked in the vertical 
plane by measuring the vertical distance between the upper and lower central in- 
cisors. Measurements were made from the tine of /k/ release to the time of 
closure for the final /p/. 

The data of this experiment will be presented in the following order: jaw 
movement for the consonant, jaw movement for the vowel, and the effect of an in- 
crease in speaking rate on jaw movements for the entire utterance* 



*Text of a paper presented at the Eighth International Congress on Acoustics, 
London, July 1974. 

''"Also University of Connecticut Health Center, Farmington. 

[HASKINS LABORATORIES: Status Report on Speech Research SR-39/40 (1974)] 
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Figure 1 shows the Jaw displacement oseasureiaents^ In iin» for the utterances 
/apa/y /ata/> and /aka/ for both subjects • "C" on the abclssa represents the 
time of closure for the consonant. The data In these graphs summarize the ex- 
tent of differences In Jaw movement for the three consonants. First » for both 
subJectSt Jaw closing Is greatest for /t/ and least for /p/# The magnitude of 
these differences^ however^ varies considerably for the two subjects* Uhereas 
the range (at time 0) from lt( to /p/ spans approximately 8 mm for TSCt the 
range Is only 3 mm for TG* It should be noted^ moreover » that for both subjects » 
the range of displacement differences decrease s when t!ie vowel is either /l/ or 
/u/, while /a/ shows by fsiii the greatest effects* 

Jftw displacement for both Itf and /k/ are relatively Insensitive to changes 
In the preceding and following vowels* However » Jaw displacement for /p/ shows 
both anticipatory and carryover coartlculation effects of the adjacent vowels; 
that is» Jaw displacement for /p/ is greater when the preceding or following 
vowel Is open (Su8sman» MacNellage^ and Hansont 1973)* These differences are as 
great as 3 mm. 

Figure 1 also shows individual differences in consonant effects on Jaw dis- 
placement for the adjacent vowels. Subject FSC shows rather large differences 
(8-9 mm) In degree of displacement for the vowel» while TG shows essentially 
none. Again the differences here for FSC are considerably less for /!/ end 
virtually absent, for /u/^ and /a/ is the only vowel that shows consistent coar- 
tlculation effects! 

Apparently » Jaw displacement for the open vowel is greatest when th3 conso- 
nant is /p/ because the Jaw is leat^t involved in the production of this conso- 
nant. Both /t/ and /k/, on the other hand> are characterised by greater degrees 
of Jaw closing. This probably acts to constrain the degree of opening for the 
following vowels* 

The differences in Jaw displacement for /a/ as illustrated for FSC are 
clearly related to differences In tongue height for the vowel* Figure 2 shows 
the same Jaw s^asurements as before, this time plotted along with measurements 
of tongue height. Ttie tongue height measurements represent the relative posi- 
tions of a 2.5 mm lead pellet attached to the surface of the tongue at a distance 
of approximately 1 1/2 in from the tip* The measurements were plotted irom a 
fixed coordinate system that used various anatomical features as landmarks. 

While both sets of data for TG show identical movement patterns, the tongue 
measurements for FSC clearly shadow those for the Jaw* Indeed, the numerical 
differences for the three vowel curves are similar for tne graphs of both the 
tongue height and the Jaw opening. 

We now consider how jaw displacement is affected by the first and second 
vowels* Figure 3 Illustrates the effect across the consonant of the second 
vowel on the displacement of the Jaw for the first vowel. Both the first vowel 
and consonant are the same; only the second vowel is different* For both sub- 
jects. Jaw displacement for the first vowel is not at all sensitive to changes 
in the second vowel (differences in maximum displacement never exceed 1 to 2 mm) * 
The absence of any consistent anticipatory coartlculation effects holds up as 
well for all the other consonant and vowel sequences* Although Jaw displacement 
for the first vowel in the VCV utterance is not sensitive to any right-to-left, 
or anticipatory, effects beyond the consonant. Jaw displacement for the second 
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vowel is subject to some left-to-right, or carryover! vowel effects; these 
effects 9 however » are fairly complicated and are linked to the consonant* 

When the consonant is /p/t the first vowel has no real effect on Jaw dis- 
placement for the second vowel. Jaw displacement for all three vowels remains 
quite stabile. However » when the consonant is either /t/ or /k/t carryover 
effects appear. Although Jaw displacement for /i/ and /u/ in the second vowel 
position remains stable, the first vowel exerts a strong influence on the dis- 
placement of the Jaw for /a/, this time for both subjects. This is illustrated 
for the ftf series of utterances in Figure 4. 

This figure shows the jaw displacement curves for the utterances /ita/, 
/ata/, and /uta/ for both subjects. These graphs show that there is less jaw 
opening for /a/ in the second position when the first vowel is /a/ than when 
the first vowel is either /i/ or /u/» At first glance these effects are some- 
what surprising. It would seem more likely, at least intuitively, that greater 
degrees of opening for the second vowel would be associated with a more open , 
rather than a more close first vowel. However, closer inspection of these 
graphs can explain the effects* At about the time of closure for the consonant 
(0 on the abcissa), the Jaw is in approximately the same position for each of 
the three different first vowels^ At that point, however, the jaw is c losing 
toward minimum opening from /a/ while it is already beginning to open for the 
second vowel from both /i/ and /u/. Thus, the jaw is moving in different direc- 
tions at that point, and, in effect, has a head start towards the second vowel 
when the first vowel is close. 

Figure 4 illustrates another aspect of jaw opening for the vowels: /u/ is 
the only vowel characterized by a closed Jaw position. Both /a/ and /i/ are 
characterized by a more open jaw position — /a/ for obvious reasons, and /i/ prob- 
ably to make room for the bunching of the tongue. This finding argues against 
an articulatory feature system that includes both the tongue and jaw in the 
description **close.^* Apparently the tongue and jaw can move independently and 
even in different directions at the same time. 

The differences in jaw opening for the open vowel are again related to dif- 
ferences in tongue height for that vowel. This is illustrated in Figure 5. 
which shows the previous jaw measurements plotted along with the measurements 
for tongue height. As can be readily seen, the two sets of data closely follow 
each other for /a/A The extent to which jaw opening controls tongue height for 
the open vowel can be seen even more clearly in Figure 6. 

This figure shows the two sets of measurements of the previous figure 
plotted against each other, that is, the jaw opening measurements were sub- 
tracted from the tongue height measurements to obtain the net movement curves 
for the tongue. These data show two things: first, that the three tongue 
curves follow essentially the same paths, clearly demonstrating that the differ- 
ences in tongue height were related solely to differences in jaw opening, and 
second, displacement for the open vowel /a/ is controlled primarily by the jaw. 
This is shown by the relative flatness of the curves once the tongue moves down 
to the floor of the mouth from the preceding phone. These data, of course, sup- 
port the model proposed by Lindblom and Sundberg (1971) . 



Note, however, the differences in tongue and jaw movement for /i/ — each seems 
to move independently of the other. 
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The effect of an Increase in speaking rate on the jK)veiaent of the Jaw Is 
illustrated in Figure 7» which shows the jaw displacement measures of the utter- 
ances /api/, /apa/, and /apu/ for both the slow and fast speaking rates for FSC. 
Generally speaking, an increase in speaking rate results in a decrease in jaw 
displacement for the entire utterance. In other words, jaw movement during 
fast speech mirrors jaw movement during slow speech but from a more closed posi- 
tion. Also, the context effects that appeared for /a/ during slow speech were 
generally absent during fast speech. This is probably due to the more restricted 
path of movement of the jaw during fast speech. 

The results of this experiment can be sumaarized as follows. 

Jaw movement towards and jaw position for a scop consonant are different de- 
pending upon the consonant. However, the magnitude of these differences varies 
with the subject. Indeed, the individual differences are striking. Vowel 
effects on jaw displacement for the consonant occur only for /p/, apparently be- 
cause the jaw is least Involved in the production of /p/; /t/ and /k/ are both 
characterized by a more closed Jaw position and are probably more resistant to 
coarticulatory influences. 

Although a considerable amount of time was spent describing and illustrat- 
ing the variability of jaw displacement for vowels, such variability is really 
the exception rather than the rule. Variability of jaw displacement occurred 
only for /a/, and then only for certain contexts. Jaw movement and displacement 
for /!/ and /u/ were quite stable in all environments! This would indicate that 
the acoustic properties of /a/ are either more insensitive to articulatory vari- 
ability than those of /!/ and /u/ or that the acoustic field of /a/ is somewhat 
larger. The data of this study also support the model proposed by Lindblom and 
Sundberg (1971), that tongue height for an open vowel is controlled by the de- 
gree of jaw opening for that vowel, but the data argue against a feature system 
that proposes a one-to-one relationship between tongue height and jaw opening 
for all vowels. 

The effect of an increase in speaking rate was reflected by an overall de- 
crease in jaw displacement for the vowel, a decrease that absorbed virtually all 
the context-dependent coarticulation effects present during slow speech. 

Finally, the major conclusions that can be drawn from the data of this ex- 
periment are that the jaw is. Indeed, sensitive to certain anticipatory and 
carryover coarticulation effects, but only to a limited degree; individual dif- 
ferences in jaw movement are real and often large; and the jaw is, in a real 
sense, a primary articulator, controlling tongue height for an open vowel. 
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A Preliminary Electromyographic Study of Labial and Laryngeal Muscles in Danish 
Scop Consonant Production 

Eli Fischer* J^rgensen and Hajime Hirose 



INTRODUCTION 

Danish has six stop consonants: £» t,,k,b,d,£. Normally, £tk are distin- 
guished from bdg only in syllable-initial position followed by a f»ill vowel 
(with or without intervening jL or r), e.g., pile/bile ['p^i:la, *bi:la), 
Iclat/glat [klad, glad], betsle / pedal Ebe'tba:?l©, p*»e'da:?l]. Medially before 
shwa only [bdg] are found (irrespectively of the orthography) , and in final 
position [ptkl and [bdg] vary freely.^ 

In the positions where ptk and bdg are phonologically distinct* ptk are 
strongly aspirated with a voice-onset time (VOT) value of 60-80 msec In more 
old-fashioned speech (in modem Copenhagen speech somewhat more) , and t^ is more- 
over affricated, whereas bdg are unaspirated (see Fischer- J^rgenscn, 1954). 
Both categories are voiceless, and there is no evidence that ptk shot&ld be more 
tense than bdg in the narrower sense of having stronger articulatory activity. 
There is rather some evidence to the contrary — that bdg has a longer closure 
than ptk and a tendency to higher mechanical pressure at the place of articula- 
tion. The difference in duration of the closure is small (normally 20-30 msec 
for W£ and 40-50 msec for d/t) , but it Is stable and statistically significant 
(Fischer-J^rgensen. 1954; Pr^kjae r- Jensen, Ludvigsen, and Rischel, 1971). Tiie 
mechanical pressure is rather variable, and the difference is significant only 
for some subjects. A questionnaire concerning the kinesthetic judgment of 
effort in stop consonants has shown that 67 percent of the speakers felt the 
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here. It should, however, be emphasised that we are not describing the pho- 
netic difference between the phonemes ptk and bdg , but the difference between 
the "microphonen^s" in syllsble-initial position. 
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organic pressure to be greater in bdg and only 22 percent found it to be greater 
in ptk ^ whereas 10 percent did not feel any difference (Fischer-^ J^rgensen^ 1972). 
The intraoral air pressure of ptk is about 5 percent higher than that of bdg t 
but the difference is found only at the end of the closure. These relations be- 
tween ptk and bdg differ from what is generally thought to be characteristic of 
the two categories of stops. 2 it therefore seemed interesting to undertake an 
electromyographic (EMG) investigation of the labial muscles for £ and b^. 

Since the main difference between Danish ptk and bd^ is one of aspiration^ 
the condition of the glottis should be of primary importance. A glottographic 
examination has shown that Danish ptk have a wide-open glottis with the maximum 
aperture close to the point of releaset whereas bdg have a smaller aperture with 
the maximum near the beginning of the closure and practically zero aperture at 
the release (see Fr^kjae r- Jensen, 1967» 1968; and particularly, Fnikjae r- Jensen 
et al.t 1971). These findings seem to be confirmed by a preliminary fiberoptic 
investigation undertaken by J($rgen Rischel. Whereas the wide aperture of the 
glottis in pt k must be due to the neural command, the smaller aperture in bdg 
might (according to a hypothesis advanced by Fr^kjae r- Jensen et al., 1971) be 
due to aerodynamic forces only. In order to test this hypothesis, we made an 
EMG investigation of the laryngeal muscles. 

EXPERIMENTAL METHOD 

Electrode Insertion Technic^ue 

In the present study » hooked-wire electrodes were used. Insertion into the 
labial muscles was made approximately at the positions described by Leanderson 
(1972) . During electrode insertion an oscilloscope and an amplifier system were 
used for monitoring the pertinent muscle activity. Insertion into the orbicu- 
laris oris superior (OOS) and inferior (001) was made at the vermillion border 
of the upper lip and the lower lip about 1 cm laterally to the midline. If EMG 
activity was found in OOS or 001 for protrusion of the lips or for production of 
labial stops, the placement was considered to be correct. 

The depressor anguli oris (DAO) was reached at the point 1.5-^2 cm below the 
angle of the mouth. Insertion into the depressor labil inferior (DLI) was made 
about 2-2.5 cm below the point of insertion into 001. To verify the correct 
placement for these muscles » the subject was asked to pull the angle of his 
mouth or his lower lip downwards. One cannot always be certain in differenti- 
ating between DAO and DLI sirce there is a possible overlapping of the fibers of 
these muscles* Anatomically, however » DAO is known to be distributed more super- 
ficially than DLI. Therefore, the depth of insertion was controlled to avoid 
interference. 

In order to reach the mentalis (MENT), the needle was inserted in the mid- 
line deeply enough to feel the bony surface of the mandible. The placement was 



Weaker mechanical pressure in aspirated stops was, however, found as early as 
1897 by Rousselot (p. 596) » and the same was found for Gujarat! (Fischer- 
J^rgensen» 1968a: 96). The measurements of organic pressure and intraoral air 
pressure in Danish stops have not been published in detail accept for a bilin-^ 
gual subject (Fischer- J^rgensen» 1968b). 
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considered to be correct tf EMG activity occurred when the subject attempted to 
move the skin over the chin upwards. The technique of insertion into the laryn- 
geal muscles was essentially the same as that described by Hirose (1971a; Hirose, 
Gay, and Stroiae, 1971). The interarytenoid (INT) and the posterior cricoary- 
tenoid (PCA) were reached perorallyt using an L-shaped probe under indirect 
laryngoscopy. A percutaneous approach was employed for insertion into the thy- 
roarytenoid (VOC), the lateral cricoarytenoid (LCA) , and the cricothyroid (CT) . 

Data Recording and Processing 

EMG signals were recorded on a multichannel data recorder simultaneously 
with acoustic signals and automatic timing markers. The signals were then re- 
produced and fed into a computer after appropriate rectification and integra- 
tion. The EffG signal from each electrode pair was averaged for each utterance 
type with reference to a line-up point on the time axis representing a predeter- 
mined speech event. In this experiment the line-up point was always at the end 
of the frame ([han sa:] "he said"), i.e., at the implosion of the following con- 
sonant. The data-recording and computer-processing systems used in the present 
experiment are described In more detail by Port (1971) and Kewley-Port (1973) . 

Subjects and Material 

Recordings were made of six subjects: PM, EG, EFJ, TB, PH, and HA, who all 
speak Standard Danish, although with slight local differences. PM, PH, and TB 
are from Copenhagen, EG from Jutland. EFJ (one of the present authors) grew up 
in Funen, but has never spoken Funish dialect. HA is from Copenhagen, but has 
been in America since the age of 18; his Danish seems unspoiled. TB and HA have 
relatively long aspirations of £tk. The age of the subjects (in 1972) was 24-39 
years, except for EFJ (61) . 

Not all subjects were able to tolerate peroral insertion into the laryngeal 
muscles or too many insertions in the labial area, and the material is therefore 
rather heterogeneous. Table 1 gives a survey of the muscles examined for the 
different subjects. Parentheses indicate that the interpretation of the curves 
is dubious. TB has much noise in his 001, and PH's OOS was completely unusable 
because of noise and has been left out altogether. 

The linguistic material used consisted of real Danish words spoken In the 
frame han sagde [han sa:] "he said." The words are listed in Table 2 in sys- 
tematic groupings. The initials of the subjects who spoke the words in question 
are given in parentheses for each section of the list. The words in la and 2a 
were spoken by all subjects. The words in lb were not ^poken by HA and f.he 
words in 2b were not spoken by EG and EFJ. Group 3 was intended for an examina- 
tiom of the Danish 'st^d' (see Fischer-J^rgensen and Hirose, 1974), but as it 
contains some words with initial £ and m, it is also partly relevant for the ex- 
amination of the consonants. List 3a was read by PM, TB, PH, and EFJ; 3b only 
by EFJ. 

The words were presented in a list containing four different randomizations. 
The list was read four times by each subject. There are thus 16 examples of 
each word (but some had to be eliminated because of hesitations, artifacts, 
measurement error in editing the raw data, etc., at the time of computer process- 
ing). In the list read by EFJ 2a occurred only once, so that there were only 
four examples of each word. The consonant list was, however, spoken twice by 
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TABLE 1: Recorded muscles. 

Subjects: PM EG EFJ TB PH HA 

I II 

Labial 

muscles: OOS x x x x x x 

001 XXX X (x)^ 

DLI XXX XXX 

DAO X X 

MENT XXX X 

mid DLI-DAO x x 

Laryngeal 

muscles: PCA x 

INT X X 

VOC X XX 

LCA X 
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TABLE 2: Words read by the subjects. (The words 
are given In phonetic transcription.) 

la. 'p^ana 'bana 'pN:ia 'bi:la 'pNi:a *bu:a p^*da:?l be*tha:?l» 
(PM, EG, EFJ, TS, PH, HA) 

lb. pha'gal? ba'k^anTt p^u'rist bu'dist (PM, EG, EFJ, PH, TB) 

2a. 't^ana 'iana 'k^ala 'gals (PM, EG. EFJ, PH, TB, HA) 

2b. 'tH:a '11:0 'kHla 'gila tl^u:a du:a k^u:la gu:l3 
(PM, TB. Pi?, HA) 

2c. sala fala (HA) * 

3a. •le:sB '1e:?sB •pH:bM 'p^i:7btf man rmnt (PM, TB, PH, EFJ) 

3b. 'lestt 'p^ibw ma:?n man?n •k^e:l3 •k^e:?ltf k\l\i 'k^elTfcf •hu:an 
•hu:7an (EFJ) ' 

EFJ in the same session. It was also spoken twice by PM, in two different 
sessions called I and II. 

RESULTS AND DISCUSSION 

Labial Muscles 

General activity of labial muscles for labial stops . First we may ask: 

(a) which of the investigated muscles were found to be active for labial stops, 

(b) whether they are active for the closing or the opening movement, and (c) 
whether they are active for other sounds found in the material as well. There 
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are some individual differences on this point, which may in some cases be due 
to the placement of the electrodes. 

The results are summarized in Table 3. The symbol + indicates activity of 
the muscle, 0 means "no activity." Capital letters indicate subjects and paren- 
theses around a letter indicate that the activity is weak for this subject. The 
letters i^, u indicate the vowels following the stop consonant (i includes the 
£ of [p^*da:?l] and [be*t^a:?la]) . Vowels are given only in the cases where 
the activity is different before different vowels. 

The activity for the closing movement has its peak almost at the line-up 
point, i.e., where the vowel [a:] of the frame ends and the implosion of the 
consonant takes place. It starts around 100 msec earlier and ends around 100- 
200 msec later than the line-up, but, particularly for COS, the peak is gener- 
ally very pronounced. COS is active at the implosion for all subjects, 001 for 
all but one (PH, who, however, has a small peak in the preceding vowel). One of 
the subjects (EG) also has a definite activity of DLI; two others (PM and HA) 
have a ar -all peak for DLI at the implosion. Both of the subjects for whom DAO 
was recorded show a definite peak at the implosion. The same is true for an 
electrode point between DAO and DLI (mid DLI-DAO in Table 1) which is not in- 
cluded in Table 3. MENT is clearly active for one of the three subjects and 
has a weak activity for one of them. This means that for the closing movement 
there is strong activity in OOS and 001 (with the exception of subject PH for 
001), activity in DAO for both subjects, and clear activity for one of three 
subjects in MENT. 

The activity preceding the release (approximately 100 msec later) is quite 
different. Practically nobody has any activity in OOS (except for EFJ who has a 
small peak before i) , whereas all have activity in 001 and DLI but with a char- 
acteristic distribution depending on the following vowel (see Table 4) . All have 
activity in 001 before u^ and in DLI before a» i;3 EG also has activity of OOI 



TABLE A: Activity of OOI and DLI at the release. 
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Visual inspection of the lips also reveals that they are pressed somewhat for- 
ward at the release of pu , whereas the lower lip goes down in pa . 
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before a and 1^, and EFJ before (but weaker) . Only PH has no clear difference 
between the two muacXes (except that DLI is slightly weaker before u) . 

This div^^loc. of labor is confirmed by an examination of the activity of 
the two miscles at the release of consonants other than the labial ones. All 
subjects have a tivity in 001 before and during the vowel u» and all have actlv* 
ity In DLii tfhau the consonant is followed by a or 1^* The activity of DLI is of 
somewhat shorter duration than the activity of 001 for u» but the peak is gen- 
erally not quite as sharp as that of OOS for the closing movement* All subjects 
have a stronger activity of DLI after £ than after t^ and k| and most of them 
(but not PM) have a tendency to stronger activity before a than before ±; this 
means that the activity is influenced by the size of the opening movement re- 
quired for the vowel* We have thus not found a definite activity of a definite 
muscle intended to open the lips in labial consonants The command seems to aim 
at the following vowel » and the muscle that is appropriate for producing the 
vowel is activated. This does not» however » exclude the possibility that an in- 
vestigation of other facial muscles might disclose an activity aiming specifi- 
cally at the release of a labial closure. 

The difference between DLI and 001 according to the following vowel also 
shows up in the vowel itself. It is simply the same thing. All use 001 for 
rounding . Even PH^ who did not distinguish 001 and DLI after labial consonants » 
has a clear difference after tdkg . TB's 001 curve is rather noisy and difficult 
to Interpret; he has more activity in u than in a and i^ after tdkg (and some 
activity after in all cases) » but after labial consonants there is no clear 
difference. Some subjects (EG and PM) also use OOS for rounding, l#e*ii they 
have a longer duration of the OOS activity for £u than for pa, but PM has hardly 
any activity of OOS for rounded vowels after jt and k. DAO and DLI show no 
activity at all for rounding in the present series of experiment. The dominat- 
ing muscle for rounding seems to be 001 # 

The MENT is less interesting than the other muscles because the three sub- 
jects use it differently (if it is at all the same muscle that is recorded). 
EFJ has a peak somewhat after the implosicn of the consonant, and this activity 
is much more pronounced before n (she has also a peak in the word huen) . PM has 
a peak at the closing movement. 

The activity of FM*s labial muscles (OOS^ 001, DLI, and MENT) was recorded 
in two different sessions. The general pattern is very much the same, but the 
activity of MENT was much higher the second time (in relation to other muscles) » 
and the relative peaks of OOS and 001 at the implosion were directly reversed. 
In the first session the mv value of the peak of OOS was much higher than that 
of 001; in the second session it was much lower* This demonstrates that the de- 
gree of activity of two different muscles cannot be compared, since the elec- 
t.'odes may have been placed more or less close to the active motor units (in the 
ixlustrations no attempt has been made, therefore, to use the same mv scales for 
all muscles) ♦ 

Figures 1-5 contain some characteiiatic examples illustrating what has been 
said on the preceding pages. Figure 1 shows averaged EM6 curves of OOS, 001, 
and DLI for the words [pl^a, phu:3] and Figure 2 shows those for [t^ana, 
t*^u:a] spoken by subject PM. It is evident that OOS and 001 have peaks at the 
Implosion of n. and that 001 has a second peak at the release of £ before u 
(continuing i... ;gh the vowel), whereas DLI has a second peak at the release of 
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£ before a* Similarly 001 (and partly OOS) are active for u after jt> ^ereaa 
DLI is active for a after t^. The DLX peak before the line-up point pertains to 
the start of jS in [sa:] in the frame. 

Figure 3 shows much the same pattern of OOS and 001 as Figure 1 for 
[p^ana) and [p^ra] pronotmced by EFJ (the peak before the line-up point is due 
to a certain rounding of s), while Figure 4 shows the pattern of OOS and DLI for 
[pl^ana] and [phuia] pronounced by TB. 

Figure 5 shows that EG differs from the other subjects in using OOS almost 
to the same extent as 001 for rounding » in having a high peak of DLI at the im** 
plosion besides at the release before unrounded a» and in having a second peak 
of 001 not only before rounded vowels » but also before unrounded vowels (although 
somewhat lower) • 

The material comprises both stressed and unstressed £ and b» e.g.» [*p^ana> 
p^a'gai?] and ['bana» ba^k^anTt], The*e is no difference in lip activity for 
the two stress conditions (the [b) of [^p**ibtf] and ['p**i:?b^i] has, however, less 
activity than the initial [p^] in OOS and partly in 001 for subjects PM, EFJ, 
and TB). 

The activity found in the labial muscles for Danish labial consonants is, in 
generalyin agreement with what has been found for other languages* The activity 
of OOS at the implosion is, for example, similar to the activity found by Hirose 
and Gay (1971) for English stops. Ohman, Leanderson, and Fersson (1965, 1966) 
consider DAO as a closing muscle and DLI as an opening, muscle. This is confirmed 
by the present investigation. They consider OOS and 001 as rounding muscles* A 
similar argument is made by Hadding, Hirose, and Harris (in press) on Swedish 
rounded vowels. This is confirmed for 001, but OOS was used by only some sub- 
jects in the present aeries, while both OOS and 001 function as closing muscles. 

Differences between p and b » The general observations of muscle activity 
are the more reliable results of the investigation, whereas we did not get a 
clear answer to the problem that was the starting point: the difference between ' 
2. and b^« There are both individual differences and inconsistencies for the same 
speaker, and only a few of the differences between individual means of £ and b 
are statistically significant. But some tendencies are obvious. 

The words which can be directly compared are those in Table 2 (la and lb). 
All subjects read la, all but HA read lb, and PM and EFJ read the list twice 
(separate means are taken of these two readings). This gives twelve word-pairs 
for PM and EFJ, six for EG, TB, and PH, and four for HA« 

As for the closing muscles, there are no examples for subject FH since his 
OOS data are bad, and he does not use his 001 as a closing muscle<' EFJ has a 
clear tendency to more activity for b. In ten of twelve pairs b, has a higher 
maximum of activity in OOS, and in nine of twelve pairs in 001 • All exceptions 
are in unstressed position. Only three pairs in stressed position have a sig- 
nificant difference (3 percent level), but the number of pairs showing the same 
difference cannot be accidental.^ Moreover, the activity for b is normally of a 



There is no difference in the MENT, but its peak is later, between the normal 
place of closing and opening activity. 
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•lightly longer duration, which iitm well with the longer cloture time. But 
none of the other subjects hee « clear difference between £ and b. The nuaber 
of individual means having higher activity for £ or for b is about equal, and 
the differences are soall. This Is true of HA (OOS), TB (OOS and 001), EG (OOS 
and DAO), and FN (OOS, 001, DAO, and the first peak of DLI). There are only a few 
deviations fron this general pattern: E6 and FM have more activity for b^ than 
for £ in KENT (for PM ten of twelve meana, for EG all six means); and EG has 
higher activity for £ than for b in 001 (five of six means). 

For the muscles that are active at the release (DLI and 001) the picture is 
somewhat different. EG has no clear difference between £ and b. TB*s 001 is so 
noisy that it is difficult to see any clear peak (perhaps there is a tendency to 
slightly higher £) ; his DLI shows a clearly higher msxlmua for b In three of 
four pairs. The other subjects (FN, EFJ, FH, and HA) show a clear tendency to 
have a higher maximum for b^ than for £. The comparable pairs are PH: 001 (six 
pairs), and DLI (six pairn); EFJ: 001 (four pairs before u); HA: DLI (three 
pairs before a, jL) ; and PM: 001 (four pairs before u) , and DLI (eight pairs be- 
fore a, 1.) . Of these 31 pairs, 30 have a higher maximum for b than for £. Al- 
though only six of the individual pairs have a significant difference, the ten- 
dency is quite clear. 

On the whole it must be said that the maxima are rather variable, but four 
of the six subjects have a clear tendency to stronger activity for b than for £ 
at the release, and one also for the closing activity, whereas the others do not 
show any clear difference in this case. 

A difference in the relation between £ and jb at the Implosion and at the 
release has also been found by Ohman et al. (1965, 1966) for a Swedish subject: 
the muscles which were active at the implosion showed a higher peak for £ than 
for b^, whereas the muscles which were active at the release showed a higher peak 
for b. Similarly Harris, Lysaught, and Schvey (1963) found a tendency to a 
higher activity of orbicularis oris for £ than for b in English at the implosion, 
whereas no difference was found in the activity of DLI at the release. Finally, 
Slis (1970) found that in Dutch the activity of orbicularis oris was significant- 
ly higher for £ than for b at the Implosion, whereas the activity of DLI at the 
release was only slightly higher for £ than for b. Despite the differences 
among the languages the relation between implosion and release is similar in all 
cases, £ being relatively stronger at the implosion than at the release. 

As for the differences among the languages, we should remenber that their 
stops are not phonetically identical: Danish has aspirated £ and voiceless b, 
Swedish has aspirated £ and voiced b, Dutch has unaspirated £ and voiced b,'1uid 
English has aspirated £ and a b which may be voiced or voiceless. If we use a 
narrower phonetic transcription where [p] indicates unaspirated £, and voiceless 
b la indicated by a small circle, the relations at the implosion can be stated 
in the following way: [p > b] (Dutch, significant difference), [ph > b] (Swedish 
and English, tendency^), [b • ph] (Danish). This is, as we woxtld hope, in good 
agreement with flndiags of "mechanical lip pressure: [p > ph > b] (for Armenian: 



This tendency has also been found for English by Lubker and Farris (1970) and 
by Tatham and Morton (1973), whereas Fromkin (1966) found the opposite tendency 
in initial position. They all measured the orbicularis oris. However, it is 
not clear to what extent the b*s of the informants were voiced. 
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Rou88elot» 1897:599; and for Gujarati: Flacher-J^rgenaen, 1968a:96), [p > b] 
(for French* partly significant: Fischer- Jt&rgenaen. 1968a: 71) » [ph > b] (for 
English, tendency: Kal^cot, 1966; Lubker and Parria, 1970) p [b (>) ph] (Danish). 
Moreover, the only Danish subject for whoa both EMG activity and mechanical 
pressure have been laeasured (EFJ) has > ph] In both cases. Thus It seems as 
If unasplrated £ has the strongest lip activity, voiced h the weakest activity^ 
and aspirated [ph] la In between. Danish [b] Is very similar to unasplrated £t 
but Is felt as somewhat weaker; It must therefore be very close to [ph]. This 
description, though based on very restricted material and needing corroboration 
by further Investigations, Is In good agreement with subjective Impressions* 

As for the difference between Implosion .iuid release Ohman et al. (1966) 
suggest that the higher EMG activity at the Implosion of Swedish £ Is conditioned 
by the higher air pressure, whereas at the release less muscle activity should 
be needed for the opening movement of £ because the air pressure works In the 
same direction. This Is an Interesting hypothesis. 

As far as the Implosion (i.e., the activity of the orbicularis oris) Is con- 
cerned, it has often been assumed that a higher lip pressure Is essary for £ 
In order to maintain the closure against a high air pressure. In some languages 
£ has In fact both a higher mechanical lip pressure and a higher Intraoral air 
pressure than b, e.g.. In French and Gujarati (Flscher-J^rgensen, 1968a:93 ff). 
This is, on the whole, the case when b Is voiced, since Intraoral pressure Is 
Intimately connected with voicing. The assus^tlon of a connection between In- 
traoral pressure and lip pressure also finds a partl4.\l support In the EMG values 
of m before a found In the present Investigation (subjects FM, TB^ and EFJ). 
For both 001 and OOS m has a lower peak than £ and and In some cases the dif- 
ference is significant. The difference is, however, not by far as large and 
clearcut as should be expected if the (very large) difference in air pressure 
were the conditioning factor (cf . also the very small differences between m and 
b^ found by Harris et al., 1965). Moreover, no clear difference in mechanical 
pressure has been found between m and b£ in Danish. Various other facts 
speak against a close connection between air pressure and lip pressure. First » 
for some sounds there Is an obvious lack of correlation: in Gujarati, for in- 
stance, the aspirated labial stops [ph] and [bh] were found to have a higher air 
pressure, but a lower mechanical Hp pressure than their unasplrated cognates 
[p] and [b]. Moreover, Tatham and Morton (1973) did not find any correlation 
between air pressure and activity of the closing muscle OOS for English £ and b 
(except for the fact that the activity of the muscle had not gone quite as far 
down at the release for £ as for b, which can hardly be of any Importance). 
Finally, the air pressure curve generally has its maximum close to the release ^ 
whereas the mechanical pressure has its maximum In the first half of the conso- 
nant, followed by rapid descent. This means that the mechanical pressure in 
the first part of the consonant is much higher than Is necessary In order to 
maintain the closure. Tatham and Morton (1973) are probably right In assuming 
that this surplus of pressure permits a high degree of variability . There is 
thus hardly any close connection between air pressure and Hp pressure at the 
Implosion* 

On the other hand, the facts about release may be Interpreted in a way that 
supports the hypothesis of Ohman et al* (1965, 1966) — that the air pressure may 
contribute to the opening of the lips and thus do some of the work for the open- 
ing muscles. If, for the time being, we assume that the preliminary results 
found for Gujarati (lip pressure: [p > ph > b]. Intraoral air pressure: 
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[ph > p > b]) will hold,^ then the argusient may run aa follova: In Swedish, 
where there is a large difference in air pressure between [ph] and [h] (two 
stepa) and a relatively ataallar difference in Hp pressure (one step)» the dif- 
ference in air presiure succeeds in reversing the relation between [ph] and [b] 
so that we get [b > ph] for the activity of the opening nuscles; in French, 
where the difference in lip pressure (two steps) is relatively larger than the 
difference in air pressure (one step), the effect of the air pressure is only 
to diminish the difference of activity for the release of [p] and [b] ([p] > 
[b]); in Danish, where there is no consistent difference in lip pressure, the 
(very small) difference in air pressure is sufficient to diminish the require* 
menta on the activity of the opening muscles for [ph] so that we get the ten- 
dency [b > ph] (for subjects who have stronger activity in the closing muscles 
for [b] the relation is kept for the opening muscles).' The Influence of the 
air pressure also appears from the fact that the Danish subjects PM and IB show 
more activity for the opening of m than for £ and b, although m had leas activ- 
ity in the closing muscles. (PH, however, has more activity of DLI for b than 
for m. ) The argument presupposes that the Xip pressure for £ does not decrease 
at a faster rate than that for b, but in French, at any rate, it has been shown 
that the lip pressure for £ is still higher than that of b at 70 percent of the 
distance from the implosion (Fischer-J^rgensen, 1968a: 97)T In any case, 5hman*s 
hypothesis deserves further testing. 

Laryngeal Muscles 

One of the aims of the recording of the laryngeal muscles was to test the 
hypothesis advanced by Fr^kj se r-Jensen et al. (1971) that there need not be any 
activity in the opening muscle for bdg . For this purpose we especially wanted 
to make recordings of the abductor muscle PCA and the adductor muscle INT. Un- 
fortunately, we were able to record from these two muscles for only one subject 
(EFJ), and from Just INT for subject HA. Examples of averaged curves are given 
in Figures 6 and 7 (subject EFJ) and in Figure 8 (lower portion: subject HA). 

Subjects EFJ*s INT curves are very clear. In the phrases [ ban 8a:'p^an3] 
and [ban 8a:*ban9] there is a large dip for a^ and £ and a somewhat smaller dip 
for b (Figure 6). Similarly in [p^a'gai?] there is a larger dip for £ and a 
smaller dip for ^, and in [bak*>an?t] a smaller dip for b and a larger dip for k 
(Figure 7). On the other hand, there is a higher peak for the vowel after £ 
than after b. Moreover, the curves show a displacement to the right of both 
consonant minimum and vowel maximum tor words with ptk compared to those with 
bdg. This is in good agreement with glottograms of the consonants in question 
(Frrfkj SB r- Jensen. 1967, 1968: Fr^kjae r- Jensen et al., 1971), which show that the 
maxisnim aperture is found in the beginning of the lip closure for b, decreasing 
to zero at the release, whereas the maximum aperture for £ is closie to the re- 
lease » decreasing during the following aspiration. The INT curves in Figures 6 
and 7 are typical. There are no exceptions to the difference between the dips 
for £tk and bdg , and this difference is evidently significant. As for the 



Tlie air pressure relation is corroborated by Nlhalanl (1974), but some further 
recordings of Indian stops by one of the present authors (EFJ) do not show a 
clear difference between [ph] and [p]. 

For English the reasoning depends on the degree of voicing of b. 
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difference in vowel peaks after the atop consonanta there are only two exceptions 
(out of 20 pairs). Although the difference between individual pairs is signifi- 
cant in only three cases* the general tendency Is quite obvious. This means 
that the activity of the closing masde is more pronounced when the preceding 
consonant has a larger glottis opening, i.e., the motor command and the result- 
ing muscle activity depend on the preceding state of the vocal tract. 

Subject HA*s INT curves look somewhat different (Figure 8, lower portion). 
He shows no difference between the dips for ptk and bdg , but the difference in 
the following vowel is very clear and stable. The vowel has a higher peak after 
ptk than after bd^ in all 11 pairs, and the difference is statistically signifi- 
cant in four Individual pairs. The increase in activity for INT is thus consid- 
erable for both subjects. It is also coamon to the two subjects that both the 
minimum in the consonant and the maximum of the following vowel are delayed in 
ptk words in comparison to bdg words. But there are some differences in the 
timing in relation to the line-up point and the absolute differences between ptk 
and bd£, which can be seen in Table A. Subject HA starts the relaxation of INT 



TABLE 4: Timing in INT for stressed £tk and bdg, in 
msec, in relation to the line-up point. 
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earlier than EFJ and reaches the minimum value earlier, but his vowel peak Is, 
nevertheless,^ later after £tk. 

The long distance from valley to peak for ptk reflects a considerably 
longer aspiration in the case of HA (his aspiration in [p^ana] is, e.g», 128 msec 
whereas EFJ has an aspiration of 66 msec)* But it is not quite clear why HA 
should start his relaxation of INT earlier for this purpose. For the labial 
muscles, his timing does not differ from that of EFJ« Air-stream curves might 
show whether the vowel preceding the consonant is breathy* His long aspiration 
and the difference in vowel peak after ptk and bdg point to a wide-open glottis 
in ptk , which must be produced by the abductor muscle only, since the relaxation 
of INT is the same for ptk and bdg, but unfortunately we did not get any curve 
of his PCA. 

The recording of PCA for subject EFJ shows a small peak about 20 msec after 
the line-up point for ptk , and a slightly smaller peak at the line-up point for 
bdg I. The peak of ptk is higher in 14 out of 16 pairs, but the difference is 
often small. The increase in activity starts around 50 msec before the implo- 
sion and goes rather quickly down again after the maximtmi (see Figures 6 and 7)« 
The activity lasts somewhat longer for ptk than for bdg. There is no peak for 
m, but there is, unexpectedly, a clear peak f or 1^ (in ['l£Sb» *lc:sW| *1c:7sh]. 
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for which INT does not show any dip. In any case there is evidence for an ac- 
tive opening movement in bdg for this subject. 
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For subjects EFJ and HA no other laryngeal muscles have been recorded* but 
we obtained recordings of LCA for PM and of VOC for PM, TB, and PH (PR's VOC is, 
however, not reliable). As shown in Figure 9, PM has a small dip in LCA for ptk 
with a minimum about 65 msec after the implosion, but no dip for bdg> He has a 
similar dip for ptk in VOC with a minimum about 83 msec after the implosion* 
The VOC curve also shows a very small dip for bdg slightly later (about 95 msec 
after the implosion) . No dips occur for m and 1. TB has a somewhat deeper mini- 
mum in VOC for ptk , about 75 msec after the implosion, as shown in Figure 8 
(upper portion). He has no dip for bdg. These minima are not very pronounced, 
but are completely regular. As the small dip in PM^a VOC is later for bdg than 
for ptk and may presumably be later than the point of maximum aperture of the 
vocal cords in bdg [which, according to Fr^kJsB r- Jensen et al. (1971), is found 
at a distance of 45 msec from the implosion (average of the three subjects)]. It 
might not have anything to do with the opening of the vocal cords; probably it 
only has to do with a relaxation of the tension of the vocal cords. This relaxa*- 
tion is more pronounced for ptk than for bdg. It is thus very improbable that 
Danish ptk should have stiff er vocal cords than bd g (cf . Halle and Stevens, 
1971) . 

Some of the INT curves show differences connected with stress. The s^ of 
the frame [ban sa:] has a dip, but it is not quite as low as that of ptk . This 
might be due to the scmewhat weaker stress of this word. This assumption is 
supported by the fact that £ and f^ in the words [san^] and [fala], found in HA*s 
list only, have a dip of the same size as that of his ptk . However, clear ex- 
amples of strong and weak stress, like [*p^ana, p^a*gai7], do not show any dif- 
ference in the minima of the £'s. On the other hand, there is a clear differ- 
ence between the peaks of the vowels in stressed and unstressed position. EFJ 
has 12 pairs of this type, and the difference is found in all of them. More- 
over, there is a corresponding difference between, for example, the first vowel 
of [p^a*gai?l and the second vowel of [ba'k**an?t] . This double comparison could 
be made in eight of the twelve pairs, and there were no exceptions. The differ*- 
ence due to stress is, however, not as large as the difference due to the pre* 
ceding consonant, and it therefore disappears if the conaonant of the unstressed 
syllable is of the ptk- type and the consonant of the stressed syllable is of the 
bdg- type, whereas it is enhanced in the opposite case. For example, the INT peak 
for the second vowel of [ba*k^an7t] is 93 mv higher than that for the first 
vowel, whereas the peak for the second vowel of [p^a^gai?] is 3 mv lower than 
that for the first vowel (Figure 7). VOC and LCA do not show any stress differ- 
ences, but a clear connection with pitch. They show a rise in the second sylla- 
ble of [*p^ana] as well as in the second syllable of [p^a^gai?]. 

The results of the present investigation of larynx muscles agree on many 
points with earlier findings. The aspirated Danish stops have ?l dip in INT and 
a peak in PCA like the aspirated English stops (Hirose and Gay, 1971); and the 
unaspirated voiceless Danish stops behave like unasplrated £ as pronounced by 
L. Lisker (Hirose, Lisker, Abramson, 1972): they have a smaller and shorter dip 
in INT than aspirated ptk and a lower peak in PCA« 

The activity patterns of VOC and LCA seem more complex than what has been 
described by one of the present authors (Hirose, 1971b) as being active in vow^ 
els only and suppressed for consonants irrespective of the type of consonants, 
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although language difference and context effects night well be taken into con- 
sideration. In the present investigation we found that hCX has a dip for ptk 
but not for hdg^ and that VOC has a deeper dip for ptk than for bdg but none for 
1. and m. Their activity thus seems to be more differentiated. Similar differ- 
ences were found in LCA in a study by Hlrose et al. (1972). 

In view of the small number of subjects, particularly for the laryngeal 
muscles, the results of this study must be considered as preliminary « Further 
investigations which are in progress at the Institute of Phonetics in Copenhagen 
seem, however, to confirm the results described in this paper. 
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A Note on Laryngeal Activity in the Danish "st^d" 
Eli Fischer-^J^rgensen and Hajime Hirose 



The Danish "st^d" is a sort of accent connected with a definite syllable 
in the word (historically it is related to accent 1 in Swedish and Norwegian 
and in Southern Danish dialects)* The stronger forms are characterized by 
creaky voice» which is found either at the end of a long vowel (**st^ in the 
vowel**) or at the beginning of a voiced consonant following a short vowel ("st^d 
in the consonant") • Syllables ending in a short vowel or in a short vowel plus 
voiceless consonants cannot have st^d. The st<$d is generally indicated in 
phonetic transcriptions by the sign for glottal stop (T) after the vowel or con- 
sonant in question, but in normal standard Danish there is no closure except in 
very emphatic speech* The occurrence of the st^d is to a large extent predict-* 
able 9 but there are some minimal pairs distinguished by the presence or absence 
of st^d> e.g»^ [lersB] 'reader* vs [le:?stt] *reads/[man] 'one, you,' (indefi- 
nite pronoun) vs [man?] 'man.' 

The most thorough phonetic iuvestigation of the st^d was undertaken by 
Svend Smith (1944) on the basis of kymograms, oscillograms, pitch curves, and 
elect romyograms of the respiratory muscles* He describes the st^d as "a stress 
accent, a special marking movement c^de by a thrust-like anphasizing of sounds" 
(Summary, p. 6), primarily consisting in a brief and intense, rather suddenly 
reduced innervation of the respiratory muscles, a sort of ballistic movement, 
combined with a more tense articulation of the whole word, which is visible in 
the initial consonant. The sudden cessation of innervation of the respiratory 
muscles results in a reduction of subglottal pressure causing a decrease in in- 
tensity and pitch, sometimes ending in irregular oscillations* He does not find 
any consistent difference in the pitrh movemeat in the beginning of the word* 

Smith's acoustic description has been confirmed by later studies by 
Margaret Lauritsen (1968) and Pia Riber Petersen (1973)* Petersen examined 
pitch and intensity curves based on tape recordings of six subjects* She found 
a very great variability in the phonetic manifestation of the st^d, but a gen- 
eral tendency to a more extensive fall in pitch and intensity in syllables with 
st^ sometimes ending in irregular vibrations* This latter phase of the st^ 
(which Smith calls the second phase) is found at approximately the same distance 
from the beginning of the vowel, corresponding to the end of a long vowel or the 
beginning of a consonant after a short vowel* The st($d is thus a syllabic 
phenomenon* 
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However, nobody hat yet tried to verify Smith's physiological description. 
He was not able to synchronize the electromyographic recordings with the audio 
signal, and it is therefore not quite certain that the activity in the respira- 
tory muscles precedes the glottal modifications; nor is it known whether there 
is an active Innervation of the glottis, and in the positive case, whether the 
activity is triggered by the respiratory activity or independent of it. 

In connection with the investigation of Danish stop consonants reported in 
the preceding report of this volume (Fischer-J^rgensen and Hlrose, 1974) some 
EMG recordings (using hooked-wire electrodes) were made of the activity of 
laryngeal muscles in Danish words with and without st^. For details of the 
technique applied, see the preceding report. No comparison with the respiratory 
Btuscles was made. The purpose was only to see whether there was a positive 
innervation of the laryngeal muscles in the st^. The subjects, PM, PH, TB, and 
EFJ, all have a clear st^d. PM, TB, and PH are from Copenhagen; EFJ grew up in 
Southern Funen, where the dialect lacks sttfd, but she has never spoken Rinish 
dialect. For EFJ a longer list containing words with and without atfid was used, 
but recordings were made only of the Interarytenold muscle (INT) and the poster- 
ior cricothyroid muscle (PCA), and they did not show any difference for words 
with and without st^d. There is a peak in PCA at the end of the w>rd [man?] 
which may, however, be due to a more vigorous opening of the glottis at the end 
of the word. 

The other subjects read the word pairs [le:s», le:7stt], [p**i:btf, p^i:?b«] 
and [man, man?] in the frame [han sa:] "he said," placed in a randomized list of 
words used for the investigation of stop consonants. Each word appeared 16 
times. 

For subject PH a recording made of the vocalls muscle (VOC) did not show 
any differences depending on the st^. This recording, however, was not very 
good. In the case of subject TB a difference was found in the activity of VOC 
in words with and without st^d (see Figure 1). The words with strfd showed a 
higher degree of activity. It should be mentioned, however, that TB's pronunci- 
ation of the st^d was somewhat exaggerated. The words with st^d were pronounced 
with higher intensity and with higher pitch in the stressed syllable than the. 
words without st^d (this is particularly true of the pair [p**l:bB/p*»l:?b»4]) , and 
the higher activity. of the VOC may be due to the rise in pitch. TB shows no 
difference between the words [p^i'rist] and [p**a'gal?], belonging to the conso- 
nant list, but a somewhat lower activity in [bu'dlst]. 

The curves of subject PM's recordings are more reliable. The recordings 
comprise the vocalls muscle (VOC) and the lateral cricothyroid (LCA). The sub- 
ject pronounced all words with a rising pitch at the end. This explains the 
general rise of the curves (see Figure 2) but apart from this, the words with 
st^d show a sudden, very clear peak in the beginning of the vowel in all three 
pairs. Moreover, there is a definite peak in the second syllable of the wards 
[p"a'gal7, ba'k"an?t, bet*»a;?la, phe'da:?l], but hardly any peak in [pNi'rist] 
and [bu'dlst]. LCA also shows a slightly higher activity, but this is not very 
clear. The initial consonant £ shows no difference, either in the inferior or- 
bicularis oris muscle (001), or in the superior orbicularis oris muscle (OOS) 
for words with and without st^d. Initial m has a slightly higher average peak 
in OOS in the words with st^d but the difference is hardly significant. 
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Thus, for one lubject, whoie curves are partlculsrly reliable, a difference 
In Innervation of the vocalis muscle has been found. The Investigations are be- 
ing continued at the Institute of Phonetics In Copenhagen* 
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